Accelerating Your Research Journey with Triton DSL
Kiung Jung
ASO Lab, Yonsei University
kiung@yonsei.ac.kr
Research Interests
Once,
Triton is everywhere
Ineffective PyTorch - Softmax Example
triton is 4x faster than
the naive torch implementation,
and slightly better than cuBLAS(torch.softmax)
python 3.13, torch 2.5.1, triton 3.2.0, RTX3090
torch.compile?
better, but far from the ideal
triton is here
python 3.11, torch 2.5.1, triton 3.1.0, T4 GPU�(Dynamo not works on python 3.13+ at the time of writing)
CUDA Software Stack
the way from App to GPU
CUDA Runtime API
CUDA Device API
GPU INSIDE
Datacenter GPU
vs Gaming GPU
is not handled in this talk…
but! they shares a lot of features
H100
GPU
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
L2 Cache
Global Memory (GDDR or HBM)
Streaming Multiprocessor (SM)
L1 Data Cache / Shared Memory
L1 Instruction Cache
A
Tensor
Core
CUDA Cores
Load/Store Units
Registers
A
Tensor
Core
CUDA Cores
Load/Store Units
Registers
A
Tensor
Core
CUDA Cores
Load/Store Units
Registers
A
Tensor
Core
CUDA Cores
Load/Store Units
Registers
GPU
SM
SM
SM
SM
SM
L2 Cache
Global Memory (GDDR or HBM)
Streaming Multiprocessor (SM)
L1 Data Cache / Shared Memory
L1 Instruction Cache
A
Tensor
Core
CUDA Cores
Load/Store Units
Registers
A
Tensor
Core
CUDA Cores
Load/Store Units
Registers
A
Tensor
Core
CUDA Cores
Load/Store Units
Registers
A
Tensor
Core
CUDA Cores
Load/Store Units
Registers
Host
CPU
DRAM
Operation Fusion
Naive PyTorch
Load(N2)/Compute/Store(N)
Consider Input Shape: N x N
Naive PyTorch
Load(N2)/Compute/Store(N)
Load(N2+N)/Compute/Store(N2)
Consider Input Shape: N x N
Naive PyTorch
Load(N2)/Compute/Store(N)
Load(N2+N)/Compute/Store(N2)
Load(N2)/Compute/Store(N2)
Consider Input Shape: N x N
Naive PyTorch
Load(N2)/Compute/Store(N)
Load(N2+N)/Compute/Store(N2)
Load(N2)/Compute/Store(N2)
Load(N2)/Compute/Store(N)
Load(N2+N)/Compute/Store(N2)
In Total:
Load - 5N2+2N�Store - 3N2+2N
Consider Input Shape: N x N
Triton
Input Shape
Input/Output Buffer Strides
Input/Output Buffer
Kernel Config (Tunable)
Consider Input Shape: N x N
Triton
Block Info
Consider Input Shape: N x N
Triton
Block Info
Index/Mask Generation (Load)
Consider Input Shape: N x N
Triton
Block Info
Index/Mask Generation (Load)
Main Logic
Consider Input Shape: N x N
Triton
Block Info
Index/Mask Generation (Load)
Main Logic
Consider Input Shape: N x N
Index Generation (Store)
Naive Pytorch | Triton |
8N2+4N | 2N2 |
In theory, the performance will differ by a factor of 4.
You can disregard computational overhead in memory-bound kernels.
Back to the Figure,
triton is 4x faster than
the naive torch implementation,
and slightly better than cuBLAS(torch.softmax)
python 3.13, torch 2.5.1, triton 3.2.0, RTX3090
Triton Language & Features
Everything is Here,
It Has About 90+ Operations
@triton.jit
@triton.autotune
Configs
Key�- above configs will be evaluated anytime the value of x_size changes
tl.constexpr
@triton.benchmark
Useful EnvVars
See the Current Support (Feb 2025)
Triton Integration into PyTorch
Useful Resources
QnA?
�kiung@yonsei.ac.kr