1 of 37

Accelerating Your Research Journey with Triton DSL

Kiung Jung

ASO Lab, Yonsei University

kiung@yonsei.ac.kr

2 of 37

Research Interests

  • Code Size Optimization
  • GPU Acceleration for Compiler
  • DL Compiler
    • NPU Compiler with MLIR
    • Performant Kernel Generation with Triton DSL

Once,

  • Memory Compression
  • DL Compiler for SoC

3 of 37

Triton is everywhere

4 of 37

Ineffective PyTorch - Softmax Example

triton is 4x faster than

the naive torch implementation,

and slightly better than cuBLAS(torch.softmax)

python 3.13, torch 2.5.1, triton 3.2.0, RTX3090

5 of 37

torch.compile?

better, but far from the ideal

triton is here

python 3.11, torch 2.5.1, triton 3.1.0, T4 GPU�(Dynamo not works on python 3.13+ at the time of writing)

6 of 37

CUDA Software Stack

7 of 37

the way from App to GPU

  1. Application
  2. CUDA Libraries
    1. cuBLAS
  3. CUDA Runtime API (libcudart.so)
  4. CUDA Driver API (libcuda.so)
    • User mode APIs
    • #include <cuda.h>
    • cuMemcpyDtoA, cuMemAlloc
  5. NVIDIA GPU Driver (nvidia.ko)

8 of 37

CUDA Runtime API

CUDA Device API

9 of 37

GPU INSIDE

10 of 37

Datacenter GPU

vs Gaming GPU

is not handled in this talk…

but! they shares a lot of features

11 of 37

H100

12 of 37

13 of 37

GPU

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

Global Memory (GDDR or HBM)

Streaming Multiprocessor (SM)

L1 Data Cache / Shared Memory

L1 Instruction Cache

A

Tensor

Core

CUDA Cores

Load/Store Units

Registers

A

Tensor

Core

CUDA Cores

Load/Store Units

Registers

A

Tensor

Core

CUDA Cores

Load/Store Units

Registers

A

Tensor

Core

CUDA Cores

Load/Store Units

Registers

14 of 37

GPU

SM

SM

SM

SM

SM

L2 Cache

Global Memory (GDDR or HBM)

Streaming Multiprocessor (SM)

L1 Data Cache / Shared Memory

L1 Instruction Cache

A

Tensor

Core

CUDA Cores

Load/Store Units

Registers

A

Tensor

Core

CUDA Cores

Load/Store Units

Registers

A

Tensor

Core

CUDA Cores

Load/Store Units

Registers

A

Tensor

Core

CUDA Cores

Load/Store Units

Registers

Host

CPU

DRAM

15 of 37

Operation Fusion

16 of 37

Naive PyTorch

Load(N2)/Compute/Store(N)

Consider Input Shape: N x N

17 of 37

Naive PyTorch

Load(N2)/Compute/Store(N)

Load(N2+N)/Compute/Store(N2)

Consider Input Shape: N x N

18 of 37

Naive PyTorch

Load(N2)/Compute/Store(N)

Load(N2+N)/Compute/Store(N2)

Load(N2)/Compute/Store(N2)

Consider Input Shape: N x N

19 of 37

Naive PyTorch

Load(N2)/Compute/Store(N)

Load(N2+N)/Compute/Store(N2)

Load(N2)/Compute/Store(N2)

Load(N2)/Compute/Store(N)

Load(N2+N)/Compute/Store(N2)

In Total:

Load - 5N2+2N�Store - 3N2+2N

Consider Input Shape: N x N

20 of 37

Triton

Input Shape

Input/Output Buffer Strides

Input/Output Buffer

Kernel Config (Tunable)

Consider Input Shape: N x N

21 of 37

Triton

Block Info

  • # of the block
  • total number of blocks

Consider Input Shape: N x N

22 of 37

Triton

Block Info

  • # of the block
  • total number of blocks

Index/Mask Generation (Load)

Consider Input Shape: N x N

23 of 37

Triton

Block Info

  • # of the block
  • total number of blocks

Index/Mask Generation (Load)

Main Logic

  • Load (N2)
  • Compute (5)

Consider Input Shape: N x N

24 of 37

Triton

Block Info

  • # of the block
  • total number of blocks

Index/Mask Generation (Load)

Main Logic

  • Load (N2)
  • Compute (5)

Consider Input Shape: N x N

Index Generation (Store)

  • Store (N2)

25 of 37

Naive Pytorch

Triton

8N2+4N

2N2

In theory, the performance will differ by a factor of 4.

You can disregard computational overhead in memory-bound kernels.

26 of 37

Back to the Figure,

triton is 4x faster than

the naive torch implementation,

and slightly better than cuBLAS(torch.softmax)

python 3.13, torch 2.5.1, triton 3.2.0, RTX3090

27 of 37

Triton Language & Features

28 of 37

Everything is Here,

29 of 37

It Has About 90+ Operations

30 of 37

@triton.jit

31 of 37

@triton.autotune

Configs

  • possible kernel config auto-tuner will try

Key�- above configs will be evaluated anytime the value of x_size changes

tl.constexpr

  • must be a compile-time constant

32 of 37

@triton.benchmark

33 of 37

Useful EnvVars

34 of 37

See the Current Support (Feb 2025)

35 of 37

Triton Integration into PyTorch

36 of 37

Useful Resources

37 of 37

QnA?

kiung@yonsei.ac.kr