3 of 50

Native Multi-Modal Generation�The current “hype”

From

Vision Language Models (Text + Image in, Text out)
Diffusion Models (Text + Image in, Image out)

To:

Interleaved Models (Text + Image in, Text + Image out)

4 of 50

A lot of papers in the past year

Text: Discrete AR

Image: Discrete AR

Text: Discrete AR

Image: Discrete Diffusion

Text: Discrete Diffusion

Image: Discrete Diffusion

Text: Discrete AR

Image: Continuous Diffusion

All based on combinations of Discrete AR / Discrete Diffusion / Continuous Diffusion etc..

5 of 50

Why not stick to next-token prediction?

6 of 50

Discrete tokens have a quality issue

Original

Reconstructed

It looks quite different up close!

*You cannot use this to know who they are even for understanding purposes

7 of 50

Discrete tokens have a quality issue

Discrete tokens have much worse reconstruction than continuous ones

https://arxiv.org/abs/2408.06072

https://arxiv.org/abs/2409.18869

Discrete

Continuous

8 of 50

Fundamental flaw of discrete tokens

Discrete tokens have to compress a lot more for the same sequence length

Continuous tokens has much higher quality in the same sequence length!

Bit compression = (4 * 8 * 8) * 3 * 8 / 15 = 409.6

Sequence Compression

Channels

8 bit color

log2(32768)

Bit compression

= (4 * 8 * 8) * 3 * 8 / (16 * 8) = 48

bfloat16

Latent channels

9 of 50

Continuous tokens have a speed issue

Diffusion requires many timesteps to converge

BAGEL: MoT with discrete + continuous tokens

10 of 50

Continuous tokens have a speed issue

Discrete tokens

Discrete tokens only requires 1 pass of the transformer

Continuous tokens requires many passes of the transformer.

…

Continuous signal.

Can be image / video / sound / actions etc…

While the sequence looks like this

The compute on the hardware is really like this!

11 of 50

The algorithms are dominated by AR and diffusion…

But none are perfect!

12 of 50

The trilemma of continuous generative models

Training stability

High quality samples

Efficient inference

GANs, Diffusion Distillation

Diffusion Models

VAEs, Normalizing Flows

Need something here!

13 of 50

Is there anything that would break the ceiling of the two?

The answer is Yes!

The algorithms are dominated by AR and diffusion…

But none are perfect!

14 of 50

Outline

Two axes of inference scaling – sequence and refinement.
“Inference-first perspective” for algorithms that scale.
Why is DDIM (and by extension, diffusion) “sub-optimal”.
New algorithms and insights from inference-first perspective.

15 of 50

How can we scale at inference-time?

Increase the number of tokens
Don’t increase the number of tokens

16 of 50

Inference-Time Scaling in Sequence Length

Increases the number of tokens

LLM Chain-of-Thought (CoT)
CoT with reasoning data
RL (DeepSeek-R1)

17 of 50

Inference-time Scaling in Refinement Steps

Does not increase the number of tokens

Diffusion models / Flow Matching

“puppy in space”

18 of 50

Categorizing existing algorithms

Does not scale in either

VAE, GAN, Normalizing Flows

Scale in sequence length, but not in refinement steps

GPT, PixelCNN, VAR, MaskGiT

Scale in refinement steps, not in sequence length

Diffusion models, energy-based models, consistency models

Scale in both (cont’d)

19 of 50

A lot of algorithms that scale in both axes

Sequence length in outer loop, refinement steps in inner loop.

Basically, how most “Autogressive + Diffusion” are done
MAR
Diffusion Forcing
Discrete LLMs

20 of 50

A lot of algorithms that scale in both axes

Scale refinement in outer loop, sequence length in inner loop.

Autoregressive distribution smoothing

21 of 50

Scaling efficiency in inference algorithm

Of course, just being able to scale up is not enough!

We also have to scale efficiently!

Infinite monkeys “can” type Shakespeare

AlphaGo enabled by how to search more efficiently

22 of 50

Three positions

1. The right inference algorithm should scale in both axes.

2. Assuming that the model has enough capacity (under universal approximation theorem), it should use as few steps as possible.

3. Analyze the inference algorithm before the training algorithm!

(Applies to continuous and discrete cases, but will focus on continuous today)

23 of 50

Application to Continuous Diffusion

The right inference algorithm should scale in both axes. (✓)

Diffusion does scale in the refinement axis.

2. Assuming that the model has enough capacity (under universal approximation theorem), it should use as few steps as possible. (𝘟)

DDIM requires multiple steps even when model have infinite capacity!

24 of 50

Application to Continuous Diffusion

What do we want from the ”right” inference algorithm?

There exists a solution to the model such that both holds:

The inference algorithm generates the right distribution in N steps (scale correctly)
The inference algorithm generates the right distribution in 1 step (scale efficiently)

Unfortunately, DDIM is NOT the “right” inference algorithm!

25 of 50

DDIM and the Inference Capacity Issue

27 of 50

Diffusion Models and Flow Matching

NOT optimal in utilizing network capacity.

Learns ODE, requiring MANY steps for accurate simulation
Ideal case: optimal use of model capacity / efficient inference-time scaling

28 of 50

Application to Continuous Diffusion

DDIM is NOT the “right” inference algorithm because model only takes a single timestep!

We can fix it by asking the model to take 2 timesteps!

Something new in the literature, known as “flow maps”

29 of 50

Analyze inference before training

Once the inference algorithms is decided, it can be trained with many different approaches!

30 of 50

Inductive Moment Matching

Not dependent on denoising score matching / flow matching
Not dependent on score-based stochastic differential equations
Solution does not have to be connected to the probabilistic ODE!

31 of 50

Intuition: ”consistency” in distributions

For timesteps s < r < t, the two distributions should be close:

Sample from x_t, one step prediction from x_t to x_s.
Sample from x_r, one step prediction from x_r to x_s.

32 of 50

Intuition: ”consistency” in distributions

We can simply use Maximum Mean Discrepancy (MMD):

Like a GAN, MMD has a “discriminator”
Unlike a GAN,

MMD uses a special family of discriminators called RKHS.
No need to “optimize” the discriminator, so training is stable!

33 of 50

Advantages of IMM

Single stage training, single objective function
Generalizes consistency models (when comparing distributions with 1 sample)
Quite stable to train
Reaches SOTA few step generation

34 of 50

Stable Training

Consistency model is a 1-particle special case

Stable training as long as >4 particles

35 of 50

Image Generation

Better than DiT/SiT
Outperform VAR-d20 �(600M param)

ImageNet-256x256 �16-step FID: 1.90

Outperform VAR-d30 �(2B param)

CIFAR-10 2-step FID: 1.98�

36 of 50

Scaling Property

37 of 50

Advancing Efficiency / Quality Frontier

38 of 50

The trilemma of continuous generative models

Training stability

High quality samples

Efficient inference

GANs, Diffusion Distillation

Diffusion Models

VAEs, Normalizing Flows

IMM

(and possibly other flow map methods)

39 of 50

Applications to Discrete Diffusion

Consider Masked Diffusion, a performant variant of discrete diffusion

Shi et al., Simplified and Generalized Masked Diffusion for Discrete Data

40 of 50

Applications to Discrete Diffusion

In mask diffusion, value changes only when input is [mask] token.

Suppose seqlen = N, and we want to sample in L << N steps:

Then there is one step that at least samples two tokens!

Shi et al., Simplified and Generalized Masked Diffusion for Discrete Data

41 of 50

Applications to Discrete Diffusion

Does the BERT-style model have “enough capacity”?

Suppose we try to predict:

The list of poker hands that consist of two English words are: [MASK] [MASK]

Then the valid responses can be: “high card”, ”two pair”, etc…
However, BERT samples each [MASK] independently, so it is also possible to generate ”high pair”, “two card” with the model!
This is not an issue with AR models because words are generated one at a time.

42 of 50

Applications to Discrete Diffusion

From the inference-first perspective:

Masked discrete diffusion might have capacity issues when trying to sample in L << N steps when using the BERT-style model, regardless how it is trained!

43 of 50

Takeaway

Analyze the inference algorithm before the training algorithm!

Continuous case: better alternatives to diffusion models
Discrete case: limitations of the BERT-style diffusion LLM

44 of 50

Inductive Moment Matching: https://github.com/lumalabs/imm

Inference first position paper:

https://arxiv.org/abs/2503.07154

45 of 50

https://lumalabs.ai/join

Happy hour @ Barstool

https://lu.ma/5s0o2hlh

Join us

46 of 50

Learning to Take Large Strides in Time

47 of 50

Generalized Interpolant

48 of 50

Model and Sampling

Model:

Want: sample follows

Naïve objective:

2-step sample

1 of 50

2 of 50

3 of 50

4 of 50

5 of 50

6 of 50

7 of 50

8 of 50

9 of 50

10 of 50

11 of 50

12 of 50

13 of 50

14 of 50

15 of 50

16 of 50

17 of 50

18 of 50

19 of 50

20 of 50

21 of 50

22 of 50

23 of 50

24 of 50

25 of 50

26 of 50

27 of 50

28 of 50

29 of 50

30 of 50

31 of 50

32 of 50

33 of 50

34 of 50

35 of 50

36 of 50

37 of 50

38 of 50

39 of 50

40 of 50

41 of 50

42 of 50

43 of 50

44 of 50

45 of 50

46 of 50

47 of 50

48 of 50

49 of 50

50 of 50