1 of 50

New Pre-training Paradigms from a Inference-First Perspective

Jiaming Song

�Luma AI

2 of 50

Video Models’ 1 year birthday

3 of 50

Native Multi-Modal Generation�The current “hype”

From

  • Vision Language Models (Text + Image in, Text out)
  • Diffusion Models (Text + Image in, Image out)

To:

  • Interleaved Models (Text + Image in, Text + Image out)

4 of 50

A lot of papers in the past year

Text: Discrete AR

Image: Discrete AR

Text: Discrete AR

Image: Discrete Diffusion

Text: Discrete Diffusion

Image: Discrete Diffusion

Text: Discrete AR

Image: Continuous Diffusion

All based on combinations of Discrete AR / Discrete Diffusion / Continuous Diffusion etc..

5 of 50

Why not stick to next-token prediction?

6 of 50

Discrete tokens have a quality issue

Original

Reconstructed

It looks quite different up close!

*You cannot use this to know who they are even for understanding purposes

7 of 50

Discrete tokens have a quality issue

Discrete tokens have much worse reconstruction than continuous ones

https://arxiv.org/abs/2408.06072

https://arxiv.org/abs/2409.18869

Discrete

Continuous

8 of 50

Fundamental flaw of discrete tokens

Discrete tokens have to compress a lot more for the same sequence length

Continuous tokens has much higher quality in the same sequence length!

Bit compression = (4 * 8 * 8) * 3 * 8 / 15 = 409.6

Sequence Compression

Channels

8 bit color

log2(32768)

Bit compression

= (4 * 8 * 8) * 3 * 8 / (16 * 8) = 48

bfloat16

Latent channels

9 of 50

Continuous tokens have a speed issue

Diffusion requires many timesteps to converge

BAGEL: MoT with discrete + continuous tokens

10 of 50

Continuous tokens have a speed issue

D1

C1

D2

C2

Discrete tokens

D1

C1

Discrete tokens only requires 1 pass of the transformer

Continuous tokens requires many passes of the transformer.

C1

C1

C1

C1

Continuous signal.

Can be image / video / sound / actions etc…

While the sequence looks like this

The compute on the hardware is really like this!

11 of 50

The algorithms are dominated by AR and diffusion…

But none are perfect!

12 of 50

The trilemma of continuous generative models

Training stability

High quality samples

Efficient inference

GANs, Diffusion Distillation

Diffusion Models

VAEs, Normalizing Flows

Need something here!

13 of 50

Is there anything that would break the ceiling of the two?

The answer is Yes!

The algorithms are dominated by AR and diffusion…

But none are perfect!

14 of 50

Outline

  1. Two axes of inference scaling – sequence and refinement.
  2. “Inference-first perspective” for algorithms that scale.
  3. Why is DDIM (and by extension, diffusion) “sub-optimal”.
  4. New algorithms and insights from inference-first perspective.

15 of 50

How can we scale at inference-time?

  • Increase the number of tokens
  • Don’t increase the number of tokens

16 of 50

Inference-Time Scaling in Sequence Length

Increases the number of tokens

    • LLM Chain-of-Thought (CoT)
    • CoT with reasoning data
    • RL (DeepSeek-R1)

17 of 50

Inference-time Scaling in Refinement Steps

Does not increase the number of tokens

    • Diffusion models / Flow Matching

“puppy in space”

18 of 50

Categorizing existing algorithms

  • Does not scale in either
    • VAE, GAN, Normalizing Flows
  • Scale in sequence length, but not in refinement steps
    • GPT, PixelCNN, VAR, MaskGiT
  • Scale in refinement steps, not in sequence length
    • Diffusion models, energy-based models, consistency models
  • Scale in both (cont’d)

19 of 50

A lot of algorithms that scale in both axes

  • Sequence length in outer loop, refinement steps in inner loop.
    • Basically, how most “Autogressive + Diffusion” are done
    • MAR
    • Diffusion Forcing
    • Discrete LLMs

20 of 50

A lot of algorithms that scale in both axes

  • Scale refinement in outer loop, sequence length in inner loop.
    • Autoregressive distribution smoothing

21 of 50

Scaling efficiency in inference algorithm

Of course, just being able to scale up is not enough!

We also have to scale efficiently!

Infinite monkeys “can” type Shakespeare

AlphaGo enabled by how to search more efficiently

22 of 50

Three positions

1. The right inference algorithm should scale in both axes.

2. Assuming that the model has enough capacity (under universal approximation theorem), it should use as few steps as possible.

3. Analyze the inference algorithm before the training algorithm!

(Applies to continuous and discrete cases, but will focus on continuous today)

23 of 50

Application to Continuous Diffusion

  1. The right inference algorithm should scale in both axes. (✓)

    • Diffusion does scale in the refinement axis.

2. Assuming that the model has enough capacity (under universal approximation theorem), it should use as few steps as possible. (𝘟)

    • DDIM requires multiple steps even when model have infinite capacity!

24 of 50

Application to Continuous Diffusion

What do we want from the ”right” inference algorithm?

There exists a solution to the model such that both holds:

  1. The inference algorithm generates the right distribution in N steps (scale correctly)
  2. The inference algorithm generates the right distribution in 1 step (scale efficiently)

Unfortunately, DDIM is NOT the “right” inference algorithm!

25 of 50

DDIM and the Inference Capacity Issue

  •  

26 of 50

The Fix

  •  

27 of 50

Diffusion Models and Flow Matching

  • NOT optimal in utilizing network capacity.

  • Learns ODE, requiring MANY steps for accurate simulation
  • Ideal case: optimal use of model capacity / efficient inference-time scaling

28 of 50

Application to Continuous Diffusion

DDIM is NOT the “right” inference algorithm because model only takes a single timestep!

We can fix it by asking the model to take 2 timesteps!

  • Something new in the literature, known as “flow maps”

29 of 50

Analyze inference before training

Once the inference algorithms is decided, it can be trained with many different approaches!

30 of 50

Inductive Moment Matching

  • Not dependent on denoising score matching / flow matching
  • Not dependent on score-based stochastic differential equations
  • Solution does not have to be connected to the probabilistic ODE!

31 of 50

Intuition: ”consistency” in distributions

For timesteps s < r < t, the two distributions should be close:

  • Sample from x_t, one step prediction from x_t to x_s.
  • Sample from x_r, one step prediction from x_r to x_s.

32 of 50

Intuition: ”consistency” in distributions

We can simply use Maximum Mean Discrepancy (MMD):

  1. Like a GAN, MMD has a “discriminator”
  2. Unlike a GAN,
    1. MMD uses a special family of discriminators called RKHS.
    2. No need to “optimize” the discriminator, so training is stable!

33 of 50

Advantages of IMM

  1. Single stage training, single objective function
  2. Generalizes consistency models (when comparing distributions with 1 sample)
  3. Quite stable to train
  4. Reaches SOTA few step generation

34 of 50

Stable Training

  • Consistency model is a 1-particle special case

  • Stable training as long as >4 particles

35 of 50

Image Generation

  • Better than DiT/SiT
  • Outperform VAR-d20 �(600M param)

  • ImageNet-256x256 �16-step FID: 1.90
    • Outperform VAR-d30 �(2B param)
  • CIFAR-10 2-step FID: 1.98

36 of 50

Scaling Property

37 of 50

Advancing Efficiency / Quality Frontier

38 of 50

The trilemma of continuous generative models

Training stability

High quality samples

Efficient inference

GANs, Diffusion Distillation

Diffusion Models

VAEs, Normalizing Flows

IMM

(and possibly other flow map methods)

39 of 50

Applications to Discrete Diffusion

Consider Masked Diffusion, a performant variant of discrete diffusion

Shi et al., Simplified and Generalized Masked Diffusion for Discrete Data

40 of 50

Applications to Discrete Diffusion

In mask diffusion, value changes only when input is [mask] token.

Suppose seqlen = N, and we want to sample in L << N steps:

  • Then there is one step that at least samples two tokens!

Shi et al., Simplified and Generalized Masked Diffusion for Discrete Data

41 of 50

Applications to Discrete Diffusion

Does the BERT-style model have “enough capacity”?

Suppose we try to predict:

The list of poker hands that consist of two English words are: [MASK] [MASK]

  • Then the valid responses can be: “high card”, ”two pair”, etc…
  • However, BERT samples each [MASK] independently, so it is also possible to generate ”high pair”, “two card” with the model!
  • This is not an issue with AR models because words are generated one at a time.

42 of 50

Applications to Discrete Diffusion

From the inference-first perspective:

Masked discrete diffusion might have capacity issues when trying to sample in L << N steps when using the BERT-style model, regardless how it is trained!

43 of 50

Takeaway

Analyze the inference algorithm before the training algorithm!

  • Continuous case: better alternatives to diffusion models
  • Discrete case: limitations of the BERT-style diffusion LLM

44 of 50

Inductive Moment Matching: https://github.com/lumalabs/imm

Inference first position paper:

https://arxiv.org/abs/2503.07154

45 of 50

https://lumalabs.ai/join

Happy hour @ Barstool

https://lu.ma/5s0o2hlh

Join us

46 of 50

Learning to Take Large Strides in Time

  •  

 

 

 

47 of 50

Generalized Interpolant

  •  

 

 

48 of 50

Model and Sampling

  • Model:

Want: sample follows

Naïve objective:

2-step sample

49 of 50

Inductive Moment Matching

  •  

50 of 50

Inductive Learning Algorithm

2 particles