1 of 28

2 of 28

Overview - 3D content creation (with text)

Recent work – DreamFusion, Magic3D, Score Jacobian Chaining… – synthesize high-quality 3D objects, but require a lengthy per-prompt optimization (15+ min).

Currently, creating 3D content is valuable, but difficult

Generating high quality assets from text descriptions makes creating usable 3D content easier

a squirrel wearing an elegant ballgown playing the saxophone

a brightly colored mushroom growing on a log

DreamFusion

Magic3D

3 of 28

Overview - the issue

Users repeatedly iterate between engineering prompt (and other parameters) then rendering results

Waiting 15+ min on each design iteration creates a sporadic and time-consuming process.

4 of 28

Overview - our solution

We solve this by optimizing a single, amortized model on many prompts.

We render unseen prompts in < 1 sec. Prior methods took 15+ minutes.

5 of 28

Overview - benefits

Benefits of:

Generalization

Interpolating between prompts

Reducing training time

Amortizing over other types of info

6 of 28

Using our method: ATT3D

Want to generate 3D objects (from text)

Choose 3D object representation – here Neural Radiance Fields (NeRFs)

We use spatially oriented features - here an instant-NGP

Want separate output for each text-prompt

So, “modulate” the parameters with the text

We use a hypernetwork on text-embedding

7 of 28

Training with our method

Use SDS loss from DreamFusion (on multiple prompts):

Now, let's overview our training.

We use the SDS loss from DreamFusion on a set of text prompts.

More explicitly, each loss evaluation in our batches involves first sampling a text prompt and associated embedding, which we cache.

Then, we compute a modulation with the text embedding using our mapping network.

As in other text-to-3D papers, we then sample the camera position, augment the text-prompt with the position, sample if we are doing a textureless or shadeless render, then perform the render, which involves creating a ray for each pixel in the frame, along which multiple points are sampled.

Then, we compute the point-encodings with the modulation for the text-prompt.

Which we use to calculate the radiance,

Which we composite into a frame.

Which we add noise to,

Which we denoise with our diffusion model.

Then we backpropagate so the rendered frame is close to the denoised frame according to the SDS loss. Now that we’ve seen how to use and train our method let's cover the benefits.

8 of 28

Benefit: Reduce compute time to train on a set of prompts

Amortization allows us to train a single model, producing various objects.

Single-prompt training, used in DreamFusion, trains a separate model for each prompt.

Let’s compare the results:

Amortized Training

Single-prompt Training

9 of 28

Benefit: Reduce compute time to train on a set of prompts

Amortization (blue) allows higher quality than single-prompt training (red) for almost all compute budgets

10 of 28

Benefit: Reduce compute time to train on a set of prompts

Amortization (blue) allows higher quality than single-prompt training (red) for almost all compute budgets

11 of 28

Benefit: Reduce compute time to train on a set of prompts

We scale to the extended DreamFusion 411 prompt set with identical model size and compute budget

We show examples where amortization re-uses components, allowing for compute savings

12 of 28

Do we have any generalization?

13 of 28

Benefit: Generalize to new prompts

We generalize to unseen testing prompts - no additional training - shown along the diagonal in red

No testing protocol for per-prompt optimization, so show initialization to align compute budget

Amortized Optimization

Per-prompt Optimization

14 of 28

Benefit: Generalize to new prompts

Amortized training (blue) achieves higher training and testing qualities than single-prompt training (red), for almost all compute budgets

No zero-shot testing protocol for single-prompt training, so show random performance

Gains grow for compositional (middle) and larger (right) prompt sets.

Generalization gap small with 50% prompts, and unseen testing better for 12.5% than seen per-prompt

We display the quality against compute budget for a split of seen & unseen prompts with our method – in blue and green – & the per-prompt optimization baseline (in red). We train on the seen prompt split and evaluate (at each compute budget) zero-shot on unseen prompts. Notably, for any compute budget, we achieve a higher quality on both the seen & unseen prompts.

We have growing benefits for larger, compositional prompt sets.

On the left, we show the small set of all 27 prompts from the main DreamFusion paper.

In the middle, we show results on the 64 compositional pig prompts. Per-prompt optimization cannot zero-shot generalize to unseen prompts, so we report a random initialization baseline. The cheap testing performance is comparable to the expensive per-prompt method, shown by the dashed blue line being near the solid red line.

On the right, we show results on a larger set of 2400 compositional animal prompts, with varying prompt proportions for training. The generalization gap is small when training on 50% of the prompts, shown by the blue lines being close together. Further, the cheap testing performance is better than the expensive per-prompt method, even when training on only 12.5% of the prompts, shown by the dashed green line being higher than the red lines.

15 of 28

Benefit: Generalize to new prompts

A single model trained on the animal prompts generalizes to unseen prompts without optimization

Amortized 50% split, unseen prompts

Amortized 12.5% split, unseen prompts

Per-prompt optimization

16 of 28

Possible Benefit: Consistent output

Amortized optimization may create objects matching prompts more consistently

“… holding a blue balloon”

Amortized optimization

Per-prompt optimization

17 of 28

Benefit: Finetune on prompts

Amortized optimization recovers the correct balloon, unlike per-prompt.

We can finetune this with Magic3D’s second optimization stage

Per-prompt

Amortized

Amortized + Magic3D

Various strategies on “a pig wearing medieval armor holding a blue balloon”

18 of 28

Benefit: Finetune on prompts

Or, just use amortization for an initialization to continue finetuning unseen prompts

Outperforms per-prompt strategy of optimizing from random initialization

19 of 28

Benefit: Interpolate between prompts

Our method allows interpolations, unlike single-prompt training.

We synthesize a continuum of novel assets by interpolating embeddings.

Here, we train on 3 prompts and zero-shot generalize to interpolants.

20 of 28

Benefit: Interpolate between prompts

Some prompts reasonable at all interpolants, but some could be improved.

Can we augment training to also amortize over interpolants?

21 of 28

Benefit: Amortize over other information

We amortize over various training methods to produce different types of results.

Using no training interpolation can naively dissolve between prompts.

No Train Interpolation

Guidance Interpolation

22 of 28

Benefit: Prompt interpolation for novel assets & animations

“... in the fall with dying leaves”

“... full of leaves in the summer”

“... with flowering cherry blossoms ”

“a baby dragon”

“a green dragon”

“a red convertible car with the top down”

“a completely destroyed car”

“... gnarly, old, leafless with many branches”

“a jagged rock”

“a mossy rock”

“...cottage with thatched roof”

“...house in tudor style”

“...dress made of fruit…”

“...dress made of garbage bags…”

Here, we show results by amortizing over the loss weights for sets of prompts. We can have creatures aging, like a “baby dragon” to an “adult dragon.”

Or objects degrading, like “a red convertible” to “a completely destroyed car.”

Or novel terrain, like transitioning to “a mossy rock” from “a jagged rock,”

Or varied buildings.

And different types of clothing.

Or more. Using interpolation, a user could quickly scrub through results making their object more damaged or styled in arbitrary ways, before proceeding to finetuning.

We use longer chains of prompts for more sophisticated animations, like seasonality in a tree, going from “a tree in the spring”

To “a tree in the summer”

To “a tree in the fall”

To “a tree in the winter.” Or, we can use arbitrary sequences of frames in longer animations.

23 of 28

Future Directions & Limitations

24 of 28

Conclusion

We presented a method for amortized optimization of text-to-3D models: ATT3D

Our method trains a single, amortized model on various text-prompts.

Benefits of:

Real-time asset generation via generalizing to prompts
User-guided generation via interpolating between prompts & amortizing over other info
Cost savings via reducing training time

A promising avenue towards general and fast text-to-3D generation

25 of 28

Jonathan Lorraine

Kevin Xie

Xiaohui Zeng

Chen-Hsuan Lin

Towaki Takikawa

Tsung-Yi Lin

Ming-Yu Liu

Sanja Fidler

James Lucas

Nicholas Sharp

26 of 28

Citations

Poole, Ben, et al. "Dreamfusion: Text-to-3d using 2d diffusion." arXiv preprint arXiv:2209.14988 (2022).
Lin, Chen-Hsuan, et al. "Magic3D: High-Resolution Text-to-3D Content Creation." arXiv preprint arXiv:2211.10440 (2022).
Wang, Haochen, et al. "Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation." arXiv preprint arXiv:2212.00774 (2022).

27 of 28

Extra slides

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28