2 of 16

Content

SGLang Roadmap

2025 H1 Strategic Focus Areas

SGLang Overview

Recent Feature Highlight

Future Roadmap (2026 Q1)

3 of 16

SGLang Roadmap

Breakthrough : Large-Scale Deployment

SGLang: Overview

Industry-First Performance

SGLang is the first open-source system that nearly match the performance of DeepSeek official blog with PD disaggregation and EP.

Performance Metrics (May 2025)

"This performance breakthrough validates our architectural decisions and positions SGLang as the go-to solution for organizations requiring enterprise-scale AI inference capabilities."

52.3k input tokens/s/node Industry-leading input processing speed

22.3k output tokens/s/node Exceptional generation throughput

5x cost reduction vs. DeepSeek API pricing

10+ teams successfully reproduced results

4 of 16

SGLang Roadmap

Breakthrough : Large-Scale Deployment

Multiple Hardware Support

H20

GB300

AMD

JAX

Spark

Intel

SemiAnalysis

5 of 16

Developer Community Expansion

1000+

Contributors

Active developers contributing code, documentation, and community support

60+

Institution

Universities, research labs, and companies actively using SGLang

20+

Enterprise Users

Companies adopting SGLang as their default DeepSeek inference engine in the first month of release

SGLang Roadmap

Community Growth & Industry Adoption

6 of 16

Recent Feature Highlight

EPD Disaggregation

Optimized data handling for large-scale multimodal models deployment.

Mini-SGLang

A lightweight yet high-performance inference framework sharing the high-level system architectures as SGLang

Spec Forge v0.2

Draft Model Training Framework

SGL Diffusion

Accelerate image and video generation for diffusion models for production-level serving

Zero-overhead Speculative Decoding

Tune the scheduler for speculative decoding and seeing 10% - 20% speedup across the board.

SGLang Roadmap

2025 H1 Strategic Focus Areas

7 of 16

Zero-overhead Speculative Decoding

SGLang Roadmap

Zero-overhead CPU runtime for LLM

SGLang has been pioneering the zero-overhead CPU runtime for LLM runtime last year.

Scheduler Tuning for Spec Decoding

~20% speedup across the board.

Stream Design

GPU Forward Stream: Handle all tensor forwarding without GPU blocking

CPU Schedule Stream: Delay one step output process and schedule continuous batching

Compatible with WideEP & PD-Disaggregation, supporting high concurrency workload
logprobs fully supported.
Capturing cuda graphs for all stages
FP8 quantization of MoE weights in the MTP layer
Deployed for Mimo, GLM5, Deepseek V32 models in real industry workloads.

8 of 16

Accelerating Image and Video Generation

SGLang Roadmap

2025 H1 Strategic Focus Areas

SGLang Diffusion

1.2–5.9× speedup for image/video generation
Supports autoregressive + diffusion multimodal models (Wan, Hunyuan, Qwen-Image, Flux)
Optimized kernels + parallelism techniques
Collaboration: FastVideo team, AntGroup

9 of 16

SpecBundle & SpecForge v0.2:

SGLang Roadmap

2025 H1 Strategic Focus Areas

SpecForge: Draft Model Training Framework

Native SGLang integration for speculative decoding optimization

Training: Scalable distributed + memory-efficient
SpecBundle: Production-grade EAGLE-3 checkpoints on large-scale datasets
Model Support: Llama 4, DeepSeek, Qwen3 MoE, GTP-OSS
Deployments: Ant, Meituan, Nex-AGI, EigenAI

10 of 16

Native Session Support with RadixCache

SGLang Roadmap

2025 H1 Strategic Focus Areas

Streaming Session: Hold KV across turns, O(1) restore
Agent-Aware Session: Pin system prompt / tool KV, skip repeated prefill
Cross-Session Sharing: Multiple sessions share one pinned prefix copy
Smart Eviction: Evict pinned KV by activity / priority under memory pressure

11 of 16

Scheduling Refactor: Decoupled Forward Patterns

SGLang Roadmap

2025 H1 Strategic Focus Areas

10 variants -> 3: Decode, UniqueExtend, VarlenExtend
Attention mask orthogonal, not a mode
IDLE / PREBUILT -> scheduler attrs, not forward modes

12 of 16

Scheduling Pipeline Refactor

SGLang Roadmap

2025 H1 Strategic Focus Areas

ScheduleBatch (CPU, mutable) → ForwardBatch (GPU, minimal) → Backend Metadata (stateless expand)
Kill ModelWorkerBatch (pass-through copying ~30 fields twice)

13 of 16

Mini-SGLang

SGLang Roadmap

2025 H1 Strategic Focus Areas

A lightweight yet high-performance inference framework sharing the high-level system architectures as SGLang

Two main objectives: providing learning resources and enabling fast prototyping for research.

14 of 16

Roadmap (2026 Q1)

Feature Compatibility and Reliability: a performant combination of all major features
Make all advanced features compatible with each other

Speculative

decoding

All kinds of

parallelism

disaggregation

Improving Compatibility

Spec V2

(in progress)

PP/EP

Refactor

All kinds of

memory pool

Mem V2

(in progress)

Overlap

scheduler

Improving Compatibility

15 of 16

Performance & Architecture Improvements

Parallelism

Multimodality

Overlap scheduler for spec decoding (default)
Prefill CUDA graph (default)
Unified memory pool for hybrid models
Mixed chunked prefill refactor
Torch compile stack
SRT core/plugin refactor
DP attention backend refactor

Roadmap (2026 Q1)

Pipeline parallelism refactor
Expert parallelism refactor
Context parallelism support
WideEP enhancement on GB200/GB300

Multimodal Extensions
SGLang Diffusion
SGLang Omni
Cookbook

Check it out on SGLang Github!

16 of 16

Question & Answer

Starred

24.7K

X: https://x.com/lmsysorg

https://www.sglang.io/

Follow and ⭐ star us!