1 of 16

SGLang Roadmap

Liangsheng Yin

Core Dev at SGLang

2 of 16

Content

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

01

SGLang Overview

02

Recent Feature Highlight

03

Future Roadmap (2026 Q1)

3 of 16

SGLang Roadmap

Breakthrough : Large-Scale Deployment

06

SGLang: Overview

Industry-First Performance

SGLang is the first open-source system that nearly match the performance of DeepSeek official blog with PD disaggregation and EP.

Performance Metrics (May 2025)

"This performance breakthrough validates our architectural decisions and positions SGLang as the go-to solution for organizations requiring enterprise-scale AI inference capabilities."

52.3k input tokens/s/node Industry-leading input processing speed

22.3k output tokens/s/node Exceptional generation throughput

5x cost reduction vs. DeepSeek API pricing

10+ teams successfully reproduced results

4 of 16

SGLang Roadmap

Breakthrough : Large-Scale Deployment

06

Multiple Hardware Support

H20

GB300

AMD

JAX

Spark

Intel

SemiAnalysis

5 of 16

Developer Community Expansion

1000+

Contributors

Active developers contributing code, documentation, and community support

60+

Institution

Universities, research labs, and companies actively using SGLang

20+

Enterprise Users

Companies adopting SGLang as their default DeepSeek inference engine in the first month of release

SGLang Roadmap

Community Growth & Industry Adoption

08

Community Growth & Industry Adoption

6 of 16

Recent Feature Highlight

EPD Disaggregation

Optimized data handling for large-scale multimodal models deployment.

Mini-SGLang

A lightweight yet high-performance inference framework sharing the high-level system architectures as SGLang

Spec Forge v0.2

Draft Model Training Framework

SGL Diffusion

Accelerate image and video generation for diffusion models for production-level serving

Zero-overhead Speculative Decoding

Tune the scheduler for speculative decoding and seeing 10% - 20% speedup across the board.

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

7 of 16

Zero-overhead Speculative Decoding

SGLang Roadmap

07

Zero-overhead CPU runtime for LLM

SGLang has been pioneering the zero-overhead CPU runtime for LLM runtime last year.

Scheduler Tuning for Spec Decoding

~20% speedup across the board.

Stream Design

GPU Forward Stream: Handle all tensor forwarding without GPU blocking

CPU Schedule Stream: Delay one step output process and schedule continuous batching

  • Compatible with WideEP & PD-Disaggregation, supporting high concurrency workload
  • logprobs fully supported.
  • Capturing cuda graphs for all stages
  • FP8 quantization of MoE weights in the MTP layer
  • Deployed for Mimo, GLM5, Deepseek V32 models in real industry workloads.

8 of 16

Accelerating Image and Video Generation

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

SGLang Diffusion

  • 1.2–5.9× speedup for image/video generation
  • Supports autoregressive + diffusion multimodal models (Wan, Hunyuan, Qwen-Image, Flux)
  • Optimized kernels + parallelism techniques
  • Collaboration: FastVideo team, AntGroup

9 of 16

SpecBundle & SpecForge v0.2:

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

SpecForge: Draft Model Training Framework

Native SGLang integration for speculative decoding optimization

  • Training: Scalable distributed + memory-efficient
  • SpecBundle: Production-grade EAGLE-3 checkpoints on large-scale datasets
  • Model Support: Llama 4, DeepSeek, Qwen3 MoE, GTP-OSS
  • Deployments: Ant, Meituan, Nex-AGI, EigenAI

10 of 16

Native Session Support with RadixCache

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

  • Streaming Session: Hold KV across turns, O(1) restore
  • Agent-Aware Session: Pin system prompt / tool KV, skip repeated prefill
  • Cross-Session Sharing: Multiple sessions share one pinned prefix copy
  • Smart Eviction: Evict pinned KV by activity / priority under memory pressure

11 of 16

Scheduling Refactor: Decoupled Forward Patterns

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

  • 10 variants -> 3: Decode, UniqueExtend, VarlenExtend
  • Attention mask orthogonal, not a mode
  • IDLE / PREBUILT -> scheduler attrs, not forward modes

12 of 16

Scheduling Pipeline Refactor

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

  • ScheduleBatch (CPU, mutable) → ForwardBatch (GPU, minimal) → Backend Metadata (stateless expand)
  • Kill ModelWorkerBatch (pass-through copying ~30 fields twice)

13 of 16

Mini-SGLang

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

  • A lightweight yet high-performance inference framework sharing the high-level system architectures as SGLang

  • Two main objectives: providing learning resources and enabling fast prototyping for research.

14 of 16

Roadmap (2026 Q1)

  • Feature Compatibility and Reliability: a performant combination of all major features
  • Make all advanced features compatible with each other

Speculative

decoding

x

x

x

x

All kinds of

parallelism

PD

disaggregation

Improving Compatibility

Spec V2

(in progress)

PP/EP

Refactor

All kinds of

memory pool

Mem V2

(in progress)

Overlap

scheduler

Improving Compatibility

15 of 16

Performance & Architecture Improvements

Parallelism

Multimodality

  • Overlap scheduler for spec decoding (default)
  • Prefill CUDA graph (default)
  • Unified memory pool for hybrid models
  • Mixed chunked prefill refactor
  • Torch compile stack
  • SRT core/plugin refactor
  • DP attention backend refactor

Roadmap (2026 Q1)

  • Pipeline parallelism refactor
  • Expert parallelism refactor
  • Context parallelism support
  • WideEP enhancement on GB200/GB300
  • Multimodal Extensions
  • SGLang Diffusion
  • SGLang Omni
  • Cookbook

Check it out on SGLang Github!

16 of 16

Question & Answer

Starred

24.7K

https://www.sglang.io/

Follow and ⭐ star us!