JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 10

DL FRAMEWORKS AND OPEN-SOURCE LIBRARIES��FLASHINFER

2 of 10

WHAT IS FLASHINFER?

Library and kernel generator for LLM inference on NVIDIA GPUs
Provides performant inference kernels for LLM-specific operations such as:

Attention
Sampling
Mixture-of-Experts (MoE)

2

https://github.com/flashinfer-ai/flashinfer

3 of 10

KEY FEATURES

Efficient attention kernels

Optimized operations supporting multiple attention patterns
Multiple backends: CUDA, CUTLASS, cuDNN, TRT-LLM

Memory efficiency

Efficient KV-Cache storage for diverse workloads

CUDAGraph and torch.compile compatible

3

4 of 10

FLASHINFER IN-THE-WILD

Of note: vLLM and SGLang
FlashInfer is:

Designed for developer velocity and speed in LLM deployments
Open-Source and production ready for enterprise LLM applications

4

Adoption

5 of 10

MINIMAL EXAMPLE

5

6 of 10

VLLM INTEGRATION

6

https://github.com/vllm-project/vllm

7 of 10

SGLANG INTEGRATION

7

https://github.com/sgl-project/sglang

8 of 10

JAX AND XLA

JAX sits on top of XLA
XLA targets the GPU through:

LLVM -> PTX
Triton -> PTX
cuDNN* (convs, RNN, batch norm)

8

JAX

XLA

TRITON

cuDNN

PTX

GPU

9 of 10

XLA AND FLASHINFER

XLA doesn’t support FlashInfer today
XLA supports custom calls to third-party kernel libraries
Integrating FlashInfer could give JAX/TF/XLA users access to highly optimized inference kernels without having to leave the XLA ecosystem
FlashInfer can provide rapid kernel evolution and new GPU architecture support

9

10 of 10

THANK YOU

10