2 of 14

Content

SGLang Roadmap

2025 H1 Strategic Focus Areas

SGLang Overview

Recent Feature Highlight

Future Roadmap (2026 Q1)

3 of 14

SGLang Roadmap

Breakthrough : Large-Scale Deployment

SGLang: Overview

Industry-First Performance

SGLang is the first open-source system that nearly match the performance of DeepSeek official blog with PD disaggregation and EP.

Performance Metrics (May 2025)

"This performance breakthrough validates our architectural decisions and positions SGLang as the go-to solution for organizations requiring enterprise-scale AI inference capabilities."

52.3k input tokens/s/node Industry-leading input processing speed

22.3k output tokens/s/node Exceptional generation throughput

5x cost reduction vs. DeepSeek API pricing

10+ teams successfully reproduced results

4 of 14

SGLang Roadmap

Breakthrough : Large-Scale Deployment

Multiple Hardware Support

H20

NVL72

AMD

JAX

Spark

Intel

SemiAnalysis

5 of 14

Developer Community Expansion

1000+

Contributors

Active developers contributing code, documentation, and community support

60+

Institution

Universities, research labs, and companies actively using SGLang

20+

Enterprise Users

Companies adopting SGLang as their default DeepSeek inference engine in the first month of release

SGLang Roadmap

Community Growth & Industry Adoption

6 of 14

Recent Feature Highlight

EPD Disaggregation

Optimized data handling for large-scale multimodal models deployment.

Mini-SGLang

A lightweight yet high-performance inference framework sharing the high-level system architectures as SGLang

Spec Forge v0.2

Draft Model Training Framework

SGL Diffusion

Accelerate image and video generation for diffusion models for production-level serving

Zero-overhead Speculative Decoding

Tune the scheduler for speculative decoding and seeing 10% - 20% speedup across the board.

SGLang Roadmap

2025 H1 Strategic Focus Areas

7 of 14

Zero-overhead Speculative Decoding

SGLang Roadmap

https://lmsys.org/blog/2024-12-04-sglang-v0-4/

Zero-overhead CPU runtime for LLM

SGLang has been pioneering the zero-overhead CPU runtime for LLM runtime last year.

Scheduler Tuning for Spec Decoding

~20% speedup across the board.

Stream Design

GPU Forward Stream: Handle all tensor forwarding without GPU blocking

CPU Schedule Stream: Delay one step output process and schedule continuous batching

8 of 14

Accelerating Image and Video Generation

SGLang Roadmap

2025 H1 Strategic Focus Areas

SGLang Diffusion

1.2–5.9× speedup for image/video generation
Supports autoregressive + diffusion multimodal models (Wan, Hunyuan, Qwen-Image, Flux)
Optimized kernels + parallelism techniques
Collaboration: FastVideo team, AntGroup

9 of 14

SpecBundle & SpecForge v0.2:

SGLang Roadmap

2025 H1 Strategic Focus Areas

SpecForge: Draft Model Training Framework

Native SGLang integration for speculative decoding optimization

Training: Scalable distributed + memory-efficient
SpecBundle: Production-grade EAGLE-3 checkpoints on large-scale datasets
Model Support: Llama 4, DeepSeek, Qwen3 MoE, GTP-OSS
Deployments: Ant, Meituan, Nex-AGI, EigenAI

10 of 14

Mini-SGLang

SGLang Roadmap

2025 H1 Strategic Focus Areas

A lightweight yet high-performance inference framework sharing the high-level system architectures as SGLang

Two main objectives: providing learning resources and enabling fast prototyping for research.

11 of 14

EPD Disaggregation

SGLang Roadmap

2025 H1 Strategic Focus Areas

Disaggregated Vision-Language Architecture

Novel VLM design that separates vision encoding from language processing

Performance: 6–8× lower TTFT vs. colocation (at 1 QPS)
Optimized for: Image-heavy workloads where visual encoding is the bottleneck
Deployments: Rednote Hilab, Alibaba Cloud, AntGroup

12 of 14

Roadmap (2026 Q1)

Feature Compatibility and Reliability: a performant combination of all major features
Make all advanced features compatible with each other

Speculative

decoding

All kinds of

parallelism

disaggregation

Improving Compatibility

Spec V2

(in progress)

PP/EP

Refactor

All kinds of

memory pool

Mem V2

(in progress)

Overlap

scheduler

Improving Compatibility

13 of 14

Performance & Architecture Improvements

Parallelism

Multimodality

Overlap scheduler for spec decoding (default)
Prefill CUDA graph (default)
Unified memory pool for hybrid models
Mixed chunked prefill refactor
Torch compile stack
SRT core/plugin refactor
DP attention backend refactor

Roadmap (2026 Q1)

Pipeline parallelism refactor
Expert parallelism refactor
Context parallelism support
WideEP enhancement on GB200/GB300

Multimodal Extensions
SGLang Diffusion
SGLang Omni
Cookbook

Check it out on SGLang Github!

14 of 14

Question & Answer

Starred

22.9K

X: https://x.com/lmsysorg

https://www.sglang.io/

Follow and ⭐ star us!