1 of 20

Triton for MTIA

Roman Levenstein, Shintaro Iwasaki, Ilia Cherniavskii

2 of 20

Presentation outline

  • What is MTIA?
  • Motivation for Triton for MTIA
  • Triton for MTIA feasibility study
  • Current functionality and performance of the prototype
  • Next steps
  • Open questions about Triton support for heterogeneous HW
  • Conclusion

3 of 20

What is MTIA?

4 of 20

Meta Training and Inference Accelerator (MTIA)

  • Meta’s first in-house ML accelerator
  • Main goals:
    • Perf/TCO
    • Generality and programmability
    • High Dev Efficiency via first-class integration with PyTorch and great support for kernel authoring
  • Publicly announced MTIA in May 2023

More information:

5 of 20

MTIA high-level HW architecture

High-level architecture of the accelerator

PE’s internal organization

6 of 20

MTIA Programming

  • SPMD programming model for kernels
    • Supported via MTIA-specific KNYFE DSL
    • Lower-level C++ APIs
  • A typical MTIA kernel does the following:
    • Loads data from DRAM/LLS to CBs (circular buffers) in local memory (LS) using DMA transfers
    • Performs computation with data in CBs using the fixed function units
    • Writes data back to DRAM using transfers
  • Execution on fixed-function units is asynchronous and data-flow driven via CB-based data-dependencies between ISA instructions

7 of 20

Motivation for Triton for MTIA

8 of 20

Why Triton for MTIA?

  • Growing Triton adoption
    • Triton gets adopted in the industry (that’s why all of us are here today!)
    • Triton gets traction at Meta recently
      • PyTorch 2.0 adopted Triton to automatically fuse and optimize PyTorch operators for GPUs (via TorchInductor)
      • Some teams author new kernels in Triton since writing Triton is easier than writing CUDA
  • High developer efficiency is critical to MTIA success
  • North Star vision: improve developer’s efficiency by unifying development flows and tools for ML engineers by using Triton for GPUs and MTIA
  • But Triton is primarily designed for GPUs

9 of 20

Triton for MTIA feasibility study

10 of 20

Proving feasibility by developing a working prototype to answer main open questions

  • Can Triton be efficiently lowered to run on MTIA HW?
    • Generated kernels should make effective use of major HW units (DPE, SFU, DMA, LS, CB, RISC-V vector cores)
    • Performance should be comparable to current KNYFE DSL generated kernels
  • If not, what would be critical blockers for Triton on MTIA?
    • Figure out if blockers are conceptual issues impossible to resolve
    • Propose resolution for these blockers (e.g. Triton extensions, HW architecture changes in the future generations of MTIA, etc)

11 of 20

Prototype Overview: Software Diagram

Triton Code

Triton-MLIR�(MLIR Dialect)

TritonGPU-MLIR�(MLIR Dialect)

TritonMTIA-MLIR�(MLIR Dialect)

MTIA-MLIR

RISC-V Binary for MTIA

LLVM-MLIR / LLVM-IR

Triton for MTIA�Prototype

C/C++

Triton for CUDA

Triton for MTIA Prod

Triton for MTIA Prototype

Triton

MTIA Clang/LLVM

LLVM-MLIR / LLVM-IR

PTXAS/CUBIN

Note: Triton DSL -> MTIA C++ lowering is done for a quick prototype purposes only; The proper Triton for MTIA implementation will be MLIR based

12 of 20

Prototype Functionality Coverage

  • Triton for MTIA prototype is able to generate kernels making use of all major HW fixed-function units

13 of 20

Prototype performance

  • Kernels generated by the Triton for MTIA prototype showed performance comparable with manual/KNYFE kernels for long-tail operators
  • Kernels for FCs (fully-connected layers) cannot match the performance of existing optimized kernels
    • More MTIA-specific optimizations are required to use the HW reduction network and DMA broadcasts in the PE-grid
    • May need some Triton extensions

Execution Time. Lower is better.

14 of 20

Next steps

15 of 20

Productization of Triton for MTIA

  • Decision is made based on promising results of the Triton for MTIA feasibility study and prototype
  • Productized version will be a proper Triton backend for MTIA
    • Use MLIR compilation flows
    • Introduce MTIA specific dialects and lowerings in the lower part of the compilation pipeline
  • Improve PT2/TorchInductor -> Triton integration to better support Triton for MTIA

Triton Code

Triton-MLIR�(MLIR Dialect)

TritonGPU-MLIR�(MLIR Dialect)

TritonMTIA-MLIR�(MLIR Dialect)

MTIA-MLIR

RISC-V Binary for MTIA

LLVM-MLIR / LLVM-IR

FILTER-IR

KNYFE Code (BLOCK)

KNYFE DSL

Triton

MTIA Clang/LLVM

LLVM-MLIR / LLVM-IR

PTXAS/CUBIN

Triton for CUDA

Triton for MTIA

KNYFE (DSL for MTIA)

Shared between Triton and KNYFE

16 of 20

Open questions about Triton support for heterogeneous HW

17 of 20

Need for improved support for memory subsystems

  • Support for DMA-friendly memory access
    • ML accelerators often cannot use random access to memory efficiently or cannot use it at all
      • Need to use DMAs for efficient access
    • DMAs can typically only read/write a strided region (with other restrictions)
    • Traditional Triton kernels still use a tensor of pointers for tl.load() and tl.store()
      • Inferring that load()/store() are actually working with DMA-friendly memory region is non-trivial in general case
    • Wider use of Block Pointers is preferred
      • Easy to map to DMAs on ML accelerators
      • Would help on GPUs too
  • Support for additional/custom levels of memory hierarchies
    • E.g. MTIA has LLS (Last Level Scratch) with configurable size and shared across PEs, which cannot be easily utilized by Triton currently
    • It would be good if Triton would allow for defining/using custom levels of memory hierarchy

18 of 20

Need for improved support for asynchronous execution

  • Triton as of today is fundamentally an imperative DSL
  • No support for exploiting the asynchronous nature of some custom HW units or async APIs
    • Execution on MTIA fixed-function units is asynchronous and based on a data-flow pattern; DMAs are also asynchronous
    • Recent versions of CUDA introduced some asynchronous APIs too
  • Likely requires Triton DSL and semantics extensions

19 of 20

Need for supporting cross-PE primitives and PE topology-aware codegen for custom HW targets

  • Lack of support for DMA-broadcast and inter-PE reduction significantly hurts the performance of GEMM-like kernels on MTIA
    • Efficient GEMM implementations are HW topology-aware (row-wise/column-wise synchronizations and reductions)
    • Hard to express them within the current Triton arbitrary and topology-unaware mapping between program instances and PEs
  • As GEMMs are special (e.g., PT2 has a special template for GEMM-like kernels), we likely just need good abstractions that can be well understood by the Triton-MTIA compiler
    • Reasonable extension to Triton/special pattern matching in the compiler with reasonable fusion capabilities

20 of 20

Conclusion

  • Triton provides a reasonable and good unifying programming abstraction for GPUs, MTIA and many other HWs
  • Triton improves developer efficiency of kernel authoring
  • Triton for MTIA is proven feasible and it will be productized
  • Meta welcomes efforts to improve Triton DSL and its implementation towards better support of heterogeneous HW and would like to participate in such efforts
  • We are hiring! :-)