2 of 14

Motivation

Larger models - Larger cold start compile time

Can we do better than O(N)?

Repeated blocks - LLM is a sequence of Transformer blocks
Do not do full model compilation, compile transformer blocks

3 of 14

Two groups - Full model and Hierarchical compilation

Pro Full model compilation

Hierarchical fundamentally limits performance speedup
Hierarchical has overhead (guards and cudagraphs)
Hierarchical is hard to implement
Hierarchical requires users to change how to compile their model
Full model high compile time, but that’s a bug. We should just fix cold start compile time.

Pro Hierarchical compilation

Hierarchical allows better than O(n) compilation time, not possible with full model
Traditional compilers do not inline everything, we should do the same for ML compilers
Hierarchical, with enough implementation, can practically give same speedup as full model

4 of 14

Hierarchical Compilation

Phase 1

Users manually move torch.compile to repeated blocks like Transformer layers
torch.compile does not recompile for different instances (it does today!)
Release timeline - PyTorch 2.5

Phase 2 (vaguely defined - unplanned)

Users apply torch.compile to full model, and noline on repeated blocks
torch.compile to perform recursive compilation to allow

5 of 14

Dynamo - What happens with inbuilt torch nn modules?

Dynamo does not inline inbuilt nn modules
Resulting Dynamo graph has a pointer to the inbuilt nn module of the udf module

6 of 14

Dynamo - What happens with inbuilt torch nn modules?

Compiled graph

UDF nn module object

(has linear module)

torch.compile

pointer to linear layer

UDF nn module class

instantiate

7 of 14

Repercussion - Recompilation

Compiled graph

UDF nn module object

(has linear module)

torch.compile

pointer to linear layer

UDF nn module class

instantiate

UDF nn module object

(has linear module)

instantiate

Compiled graph

torch.compile

Recompile!!!

8 of 14

Inline through inbuilt nn modules

No more pointer to the original nn module
Parameters are lifted as inputs - graph can be reused for another udf instance
No more recompilations!

No inlining

Inlining

9 of 14

Workstreams for Phase 1

Guard overhead
Improve Cudagraphs support
Fix latent issues and graph breaks with inlining inbuilt nn modules

10 of 14

Guard overhead

Two categories of extra guards

nn module attribute guards - Fixed

This problem is present even for non-inlined case, just that we skipped those guards
FIXED - The perf problem is fixed with C++ guards

Guards on lifted parameters

Unavoidable - We traded ID_MATCH with TENSOR_MATCH to avoid recompiles
TENSOR_MATCH already in C++, so hopefully the overhead will be small

11 of 14

Cudagraphs problem

Cudagraphs require inputs to be in static memory location
Earlier parameters were considered constants, and were marked static
With inlining, parameters are actually variables
Solution

Have multiple cudagraphs per Dynamo graph
Add a dispatcher layer to find and dispatch to the right cudagraph at runtime

12 of 14

Fix Dynamo latent issues and graph breaks

Dynamo has aggressive specialization for nn modules

Specialize on __init__, __getattr__, __setattr__

With inlining, we treat nn modules as UserDefinedObjectVariable
Improve the support for UserDefinedObjectVariable
Currently, around 40 jobs are failing (too far from counting test failures)
Hidden behind the flag

torch._dynamo.config.inline_inbuilt_nn_modules

13 of 14

Phase 1 is useful for non hierarchical compilation usecases

Automatically supports tracing parameterize
Automatically supports tracing hooks on inbuilt modules

Even if we decide to not pursue hierarchical compilation, Phase 1 is still useful

14 of 14

Before Phase 2 - Focus on full model compile times

Call for help - Inductor and AOT Autograd engineers to dive deep into cold start time
Phase 2 is complex, so due diligence on full model compilation is needed