1 of 10

DL FRAMEWORKS AND OPEN-SOURCE LIBRARIES��FLASHINFER

2 of 10

WHAT IS FLASHINFER?

  • Library and kernel generator for LLM inference on NVIDIA GPUs
  • Provides performant inference kernels for LLM-specific operations such as:
    • Attention
    • Sampling
    • Mixture-of-Experts (MoE)

2

3 of 10

KEY FEATURES

    • Efficient attention kernels
      • Optimized operations supporting multiple attention patterns
      • Multiple backends: CUDA, CUTLASS, cuDNN, TRT-LLM

    • Memory efficiency
      • Efficient KV-Cache storage for diverse workloads

    • CUDAGraph and torch.compile compatible

3

4 of 10

FLASHINFER IN-THE-WILD

  • Of note: vLLM and SGLang
  • FlashInfer is:
    • Designed for developer velocity and speed in LLM deployments
    • Open-Source and production ready for enterprise LLM applications

4

Adoption

5 of 10

MINIMAL EXAMPLE

5

6 of 10

VLLM INTEGRATION

6

7 of 10

SGLANG INTEGRATION

7

8 of 10

JAX AND XLA

  • JAX sits on top of XLA
  • XLA targets the GPU through:
    • LLVM -> PTX
    • Triton -> PTX
    • cuDNN* (convs, RNN, batch norm)

8

JAX

XLA

TRITON

cuDNN

PTX

GPU

9 of 10

XLA AND FLASHINFER

  • XLA doesn’t support FlashInfer today
  • XLA supports custom calls to third-party kernel libraries
  • Integrating FlashInfer could give JAX/TF/XLA users access to highly optimized inference kernels without having to leave the XLA ecosystem
  • FlashInfer can provide rapid kernel evolution and new GPU architecture support

9

10 of 10

THANK YOU

10