4 of 20

Meta Training and Inference Accelerator (MTIA)

Meta’s first in-house ML accelerator
Main goals:

Perf/TCO
Generality and programmability
High Dev Efficiency via first-class integration with PyTorch and great support for kernel authoring

Publicly announced MTIA in May 2023

More information:

5 of 20

MTIA high-level HW architecture

High-level architecture of the accelerator

PE’s internal organization

6 of 20

MTIA Programming

SPMD programming model for kernels

Supported via MTIA-specific KNYFE DSL
Lower-level C++ APIs

A typical MTIA kernel does the following:

Loads data from DRAM/LLS to CBs (circular buffers) in local memory (LS) using DMA transfers
Performs computation with data in CBs using the fixed function units
Writes data back to DRAM using transfers

Execution on fixed-function units is asynchronous and data-flow driven via CB-based data-dependencies between ISA instructions

7 of 20

Motivation for Triton for MTIA

8 of 20

Why Triton for MTIA?

Growing Triton adoption

Triton gets adopted in the industry (that’s why all of us are here today!)
Triton gets traction at Meta recently

PyTorch 2.0 adopted Triton to automatically fuse and optimize PyTorch operators for GPUs (via TorchInductor)
Some teams author new kernels in Triton since writing Triton is easier than writing CUDA

High developer efficiency is critical to MTIA success
North Star vision: improve developer’s efficiency by unifying development flows and tools for ML engineers by using Triton for GPUs and MTIA
But Triton is primarily designed for GPUs

9 of 20

Triton for MTIA feasibility study

10 of 20

Proving feasibility by developing a working prototype to answer main open questions

Can Triton be efficiently lowered to run on MTIA HW?

Generated kernels should make effective use of major HW units (DPE, SFU, DMA, LS, CB, RISC-V vector cores)
Performance should be comparable to current KNYFE DSL generated kernels

If not, what would be critical blockers for Triton on MTIA?

Figure out if blockers are conceptual issues impossible to resolve
Propose resolution for these blockers (e.g. Triton extensions, HW architecture changes in the future generations of MTIA, etc)

11 of 20

Prototype Overview: Software Diagram

Triton Code

Triton-MLIR�(MLIR Dialect)

TritonGPU-MLIR�(MLIR Dialect)

TritonMTIA-MLIR�(MLIR Dialect)

MTIA-MLIR

RISC-V Binary for MTIA

LLVM-MLIR / LLVM-IR

Triton for MTIA�Prototype

C/C++

Triton for CUDA

Triton for MTIA Prod

Triton for MTIA Prototype

Triton

MTIA Clang/LLVM

LLVM-MLIR / LLVM-IR

PTXAS/CUBIN

Note: Triton DSL -> MTIA C++ lowering is done for a quick prototype purposes only; The proper Triton for MTIA implementation will be MLIR based

12 of 20

Prototype Functionality Coverage

Triton for MTIA prototype is able to generate kernels making use of all major HW fixed-function units

13 of 20

Prototype performance

Kernels generated by the Triton for MTIA prototype showed performance comparable with manual/KNYFE kernels for long-tail operators
Kernels for FCs (fully-connected layers) cannot match the performance of existing optimized kernels

More MTIA-specific optimizations are required to use the HW reduction network and DMA broadcasts in the PE-grid
May need some Triton extensions

Execution Time. Lower is better.

15 of 20

Productization of Triton for MTIA

Decision is made based on promising results of the Triton for MTIA feasibility study and prototype
Productized version will be a proper Triton backend for MTIA

Use MLIR compilation flows
Introduce MTIA specific dialects and lowerings in the lower part of the compilation pipeline

Improve PT2/TorchInductor -> Triton integration to better support Triton for MTIA

Triton Code

Triton-MLIR�(MLIR Dialect)

TritonGPU-MLIR�(MLIR Dialect)

TritonMTIA-MLIR�(MLIR Dialect)

MTIA-MLIR

RISC-V Binary for MTIA

LLVM-MLIR / LLVM-IR

FILTER-IR

KNYFE Code (BLOCK)

KNYFE DSL

Triton

MTIA Clang/LLVM

LLVM-MLIR / LLVM-IR

PTXAS/CUBIN

Triton for CUDA

Triton for MTIA

KNYFE (DSL for MTIA)

Shared between Triton and KNYFE

16 of 20

Open questions about Triton support for heterogeneous HW

17 of 20

Need for improved support for memory subsystems

Support for DMA-friendly memory access

ML accelerators often cannot use random access to memory efficiently or cannot use it at all

Need to use DMAs for efficient access

DMAs can typically only read/write a strided region (with other restrictions)
Traditional Triton kernels still use a tensor of pointers for tl.load() and tl.store()

Inferring that load()/store() are actually working with DMA-friendly memory region is non-trivial in general case

Wider use of Block Pointers is preferred

Easy to map to DMAs on ML accelerators
Would help on GPUs too

Support for additional/custom levels of memory hierarchies

E.g. MTIA has LLS (Last Level Scratch) with configurable size and shared across PEs, which cannot be easily utilized by Triton currently
It would be good if Triton would allow for defining/using custom levels of memory hierarchy

18 of 20

Need for improved support for asynchronous execution

Triton as of today is fundamentally an imperative DSL
No support for exploiting the asynchronous nature of some custom HW units or async APIs

Execution on MTIA fixed-function units is asynchronous and based on a data-flow pattern; DMAs are also asynchronous
Recent versions of CUDA introduced some asynchronous APIs too

Likely requires Triton DSL and semantics extensions

19 of 20

Need for supporting cross-PE primitives and PE topology-aware codegen for custom HW targets

Lack of support for DMA-broadcast and inter-PE reduction significantly hurts the performance of GEMM-like kernels on MTIA

Efficient GEMM implementations are HW topology-aware (row-wise/column-wise synchronizations and reductions)
Hard to express them within the current Triton arbitrary and topology-unaware mapping between program instances and PEs

As GEMMs are special (e.g., PT2 has a special template for GEMM-like kernels), we likely just need good abstractions that can be well understood by the Triton-MTIA compiler

Reasonable extension to Triton/special pattern matching in the compiler with reasonable fusion capabilities

20 of 20

Conclusion

Triton provides a reasonable and good unifying programming abstraction for GPUs, MTIA and many other HWs
Triton improves developer efficiency of kernel authoring
Triton for MTIA is proven feasible and it will be productized
Meta welcomes efforts to improve Triton DSL and its implementation towards better support of heterogeneous HW and would like to participate in such efforts
We are hiring! :-)

1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

19 of 20

20 of 20