JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 37

Accelerating Your Research Journey with Triton DSL

Kiung Jung

ASO Lab, Yonsei University

kiung@yonsei.ac.kr

2 of 37

Research Interests

Code Size Optimization
GPU Acceleration for Compiler
DL Compiler

NPU Compiler with MLIR
Performant Kernel Generation with Triton DSL

Once,

Memory Compression
DL Compiler for SoC

3 of 37

Triton is everywhere

4 of 37

Ineffective PyTorch - Softmax Example

colab

triton is 4x faster than

the naive torch implementation,

and slightly better than cuBLAS(torch.softmax)

python 3.13, torch 2.5.1, triton 3.2.0, RTX3090

5 of 37

torch.compile?

colab

better, but far from the ideal

triton is here

python 3.11, torch 2.5.1, triton 3.1.0, T4 GPU�(Dynamo not works on python 3.13+ at the time of writing)

6 of 37

CUDA Software Stack

7 of 37

the way from App to GPU

Application
CUDA Libraries

cuBLAS

CUDA Runtime API (libcudart.so)

#include <cuda_runtime.h>
cudaMalloc, cudaMemcpy

CUDA Driver API (libcuda.so)

User mode APIs
#include <cuda.h>
cuMemcpyDtoA, cuMemAlloc

NVIDIA GPU Driver (nvidia.ko)

open-source
Kernel mode in OS
PTX->SASS in case of nvrtc

https://modal.com/gpu-glossary

8 of 37

CUDA Runtime API

CUDA Device API

9 of 37

GPU INSIDE

10 of 37

Datacenter GPU

vs Gaming GPU

is not handled in this talk…

but! they shares a lot of features

11 of 37

H100

12 of 37

13 of 37

GPU

L2 Cache

Global Memory (GDDR or HBM)

Streaming Multiprocessor (SM)

L1 Data Cache / Shared Memory

L1 Instruction Cache

Tensor

Core

CUDA Cores

Load/Store Units

Registers

Tensor

Core

CUDA Cores

Load/Store Units

Registers

Tensor

Core

CUDA Cores

Load/Store Units

Registers

Tensor

Core

CUDA Cores

Load/Store Units

Registers

14 of 37

GPU

L2 Cache

Global Memory (GDDR or HBM)

Streaming Multiprocessor (SM)

L1 Data Cache / Shared Memory

L1 Instruction Cache

Tensor

Core

CUDA Cores

Load/Store Units

Registers

Tensor

Core

CUDA Cores

Load/Store Units

Registers

Tensor

Core

CUDA Cores

Load/Store Units

Registers

Tensor

Core

CUDA Cores

Load/Store Units

Registers

Host

CPU

DRAM

15 of 37

Operation Fusion

16 of 37

Naive PyTorch

Load(N²)/Compute/Store(N)

Consider Input Shape: N x N

17 of 37

Naive PyTorch

Load(N²)/Compute/Store(N)

Load(N²+N)/Compute/Store(N²)

Consider Input Shape: N x N

18 of 37

Naive PyTorch

Load(N²)/Compute/Store(N)

Load(N²+N)/Compute/Store(N²)

Load(N²)/Compute/Store(N²)

Consider Input Shape: N x N

19 of 37

Naive PyTorch

Load(N²)/Compute/Store(N)

Load(N²+N)/Compute/Store(N²)

Load(N²)/Compute/Store(N²)

Load(N²)/Compute/Store(N)

Load(N²+N)/Compute/Store(N²)

In Total:

Load - 5N²+2N�Store - 3N²+2N

Consider Input Shape: N x N

20 of 37

Triton

Input Shape

Input/Output Buffer Strides

Input/Output Buffer

Kernel Config (Tunable)

Consider Input Shape: N x N

21 of 37

Triton

Block Info

# of the block
total number of blocks

Consider Input Shape: N x N

22 of 37

Triton

Block Info

# of the block
total number of blocks

Index/Mask Generation (Load)

Consider Input Shape: N x N

23 of 37

Triton

Block Info

# of the block
total number of blocks

Index/Mask Generation (Load)

Main Logic

Load (N²)
Compute (5)

Consider Input Shape: N x N

24 of 37

Triton

Block Info

# of the block
total number of blocks

Index/Mask Generation (Load)

Main Logic

Load (N²)
Compute (5)

Consider Input Shape: N x N

Index Generation (Store)

Store (N²)

25 of 37

Naive Pytorch	Triton
8N²+4N	2N²

In theory, the performance will differ by a factor of 4.

You can disregard computational overhead in memory-bound kernels.

26 of 37

Back to the Figure,

colab

triton is 4x faster than

the naive torch implementation,

and slightly better than cuBLAS(torch.softmax)

python 3.13, torch 2.5.1, triton 3.2.0, RTX3090

27 of 37

Triton Language & Features

28 of 37

Everything is Here,

https://triton-lang.org/main/index.html

29 of 37

It Has About 90+ Operations

https://triton-lang.org/main/python-api/triton.language.html

30 of 37

@triton.jit

31 of 37

@triton.autotune

Configs

possible kernel config auto-tuner will try

Key�- above configs will be evaluated anytime the value of x_size changes

tl.constexpr

must be a compile-time constant

32 of 37

@triton.benchmark

33 of 37

Useful EnvVars

https://github.com/triton-lang/triton?tab=readme-ov-file#tips-for-hacking

https://www.youtube.com/watch?v=Av1za_0o2Qs

34 of 37

See the Current Support (Feb 2025)

https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility

35 of 37

Triton Integration into PyTorch

36 of 37

Useful Resources

https://github.com/linkedin/Liger-Kernel/tree/main

https://pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html

37 of 37

QnA?

�kiung@yonsei.ac.kr