1 of 14

Hierarchical Compilation

aka Regional compilation

2 of 14

Motivation

  • Larger models - Larger cold start compile time
    • Can we do better than O(N)?
  • Repeated blocks - LLM is a sequence of Transformer blocks
  • Do not do full model compilation, compile transformer blocks

3 of 14

Two groups - Full model and Hierarchical compilation

  • Pro Full model compilation
    • Hierarchical fundamentally limits performance speedup
    • Hierarchical has overhead (guards and cudagraphs)
    • Hierarchical is hard to implement
    • Hierarchical requires users to change how to compile their model
    • Full model high compile time, but that’s a bug. We should just fix cold start compile time.
  • Pro Hierarchical compilation
    • Hierarchical allows better than O(n) compilation time, not possible with full model
    • Traditional compilers do not inline everything, we should do the same for ML compilers
    • Hierarchical, with enough implementation, can practically give same speedup as full model

4 of 14

Hierarchical Compilation

  • Phase 1
    • Users manually move torch.compile to repeated blocks like Transformer layers
    • torch.compile does not recompile for different instances (it does today!)
    • Release timeline - PyTorch 2.5

  • Phase 2 (vaguely defined - unplanned)
    • Users apply torch.compile to full model, and noline on repeated blocks
    • torch.compile to perform recursive compilation to allow

5 of 14

Dynamo - What happens with inbuilt torch nn modules?

  • Dynamo does not inline inbuilt nn modules
  • Resulting Dynamo graph has a pointer to the inbuilt nn module of the udf module

6 of 14

Dynamo - What happens with inbuilt torch nn modules?

Compiled graph

UDF nn module object

(has linear module)

torch.compile

pointer to linear layer

UDF nn module class

instantiate

7 of 14

Repercussion - Recompilation

Compiled graph

UDF nn module object

(has linear module)

torch.compile

pointer to linear layer

UDF nn module class

instantiate

UDF nn module object

(has linear module)

instantiate

Compiled graph

torch.compile

Recompile!!!

8 of 14

Inline through inbuilt nn modules

  • No more pointer to the original nn module
  • Parameters are lifted as inputs - graph can be reused for another udf instance
  • No more recompilations!

No inlining

Inlining

9 of 14

Workstreams for Phase 1

  • Guard overhead
  • Improve Cudagraphs support
  • Fix latent issues and graph breaks with inlining inbuilt nn modules

10 of 14

Guard overhead

  • Two categories of extra guards
    • nn module attribute guards - Fixed
      • This problem is present even for non-inlined case, just that we skipped those guards
      • FIXED - The perf problem is fixed with C++ guards

    • Guards on lifted parameters
      • Unavoidable - We traded ID_MATCH with TENSOR_MATCH to avoid recompiles
      • TENSOR_MATCH already in C++, so hopefully the overhead will be small

11 of 14

Cudagraphs problem

  • Cudagraphs require inputs to be in static memory location
  • Earlier parameters were considered constants, and were marked static
  • With inlining, parameters are actually variables
  • Solution
    • Have multiple cudagraphs per Dynamo graph
    • Add a dispatcher layer to find and dispatch to the right cudagraph at runtime

12 of 14

Fix Dynamo latent issues and graph breaks

  • Dynamo has aggressive specialization for nn modules
    • Specialize on __init__, __getattr__, __setattr__
  • With inlining, we treat nn modules as UserDefinedObjectVariable
  • Improve the support for UserDefinedObjectVariable
  • Currently, around 40 jobs are failing (too far from counting test failures)
  • Hidden behind the flag
    • torch._dynamo.config.inline_inbuilt_nn_modules

13 of 14

Phase 1 is useful for non hierarchical compilation usecases

  • Automatically supports tracing parameterize
  • Automatically supports tracing hooks on inbuilt modules

Even if we decide to not pursue hierarchical compilation, Phase 1 is still useful

14 of 14

Before Phase 2 - Focus on full model compile times

  • Call for help - Inductor and AOT Autograd engineers to dive deep into cold start time
  • Phase 2 is complex, so due diligence on full model compilation is needed