1 of 14

SGLang Roadmap

Qiaolin Yu

Core Dev at SGLang

2 of 14

Content

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

01

SGLang Overview

02

Recent Feature Highlight

03

Future Roadmap (2026 Q1)

3 of 14

SGLang Roadmap

Breakthrough : Large-Scale Deployment

06

SGLang: Overview

Industry-First Performance

SGLang is the first open-source system that nearly match the performance of DeepSeek official blog with PD disaggregation and EP.

Performance Metrics (May 2025)

"This performance breakthrough validates our architectural decisions and positions SGLang as the go-to solution for organizations requiring enterprise-scale AI inference capabilities."

52.3k input tokens/s/node Industry-leading input processing speed

22.3k output tokens/s/node Exceptional generation throughput

5x cost reduction vs. DeepSeek API pricing

10+ teams successfully reproduced results

4 of 14

SGLang Roadmap

Breakthrough : Large-Scale Deployment

06

Multiple Hardware Support

H20

NVL72

AMD

JAX

Spark

Intel

SemiAnalysis

5 of 14

Developer Community Expansion

1000+

Contributors

Active developers contributing code, documentation, and community support

60+

Institution

Universities, research labs, and companies actively using SGLang

20+

Enterprise Users

Companies adopting SGLang as their default DeepSeek inference engine in the first month of release

SGLang Roadmap

Community Growth & Industry Adoption

08

Community Growth & Industry Adoption

6 of 14

Recent Feature Highlight

EPD Disaggregation

Optimized data handling for large-scale multimodal models deployment.

Mini-SGLang

A lightweight yet high-performance inference framework sharing the high-level system architectures as SGLang

Spec Forge v0.2

Draft Model Training Framework

SGL Diffusion

Accelerate image and video generation for diffusion models for production-level serving

Zero-overhead Speculative Decoding

Tune the scheduler for speculative decoding and seeing 10% - 20% speedup across the board.

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

7 of 14

Zero-overhead Speculative Decoding

SGLang Roadmap

07

Zero-overhead CPU runtime for LLM

SGLang has been pioneering the zero-overhead CPU runtime for LLM runtime last year.

Scheduler Tuning for Spec Decoding

~20% speedup across the board.

Stream Design

GPU Forward Stream: Handle all tensor forwarding without GPU blocking

CPU Schedule Stream: Delay one step output process and schedule continuous batching

8 of 14

Accelerating Image and Video Generation

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

SGLang Diffusion

  • 1.2–5.9× speedup for image/video generation
  • Supports autoregressive + diffusion multimodal models (Wan, Hunyuan, Qwen-Image, Flux)
  • Optimized kernels + parallelism techniques
  • Collaboration: FastVideo team, AntGroup

9 of 14

SpecBundle & SpecForge v0.2:

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

SpecForge: Draft Model Training Framework

Native SGLang integration for speculative decoding optimization

  • Training: Scalable distributed + memory-efficient
  • SpecBundle: Production-grade EAGLE-3 checkpoints on large-scale datasets
  • Model Support: Llama 4, DeepSeek, Qwen3 MoE, GTP-OSS
  • Deployments: Ant, Meituan, Nex-AGI, EigenAI

10 of 14

Mini-SGLang

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

  • A lightweight yet high-performance inference framework sharing the high-level system architectures as SGLang

  • Two main objectives: providing learning resources and enabling fast prototyping for research.

11 of 14

EPD Disaggregation

SGLang Roadmap

2025 H1 Strategic Focus Areas

05

Disaggregated Vision-Language Architecture

Novel VLM design that separates vision encoding from language processing

  • Performance: 6–8× lower TTFT vs. colocation (at 1 QPS)
  • Optimized for: Image-heavy workloads where visual encoding is the bottleneck
  • Deployments: Rednote Hilab, Alibaba Cloud, AntGroup

12 of 14

Roadmap (2026 Q1)

  • Feature Compatibility and Reliability: a performant combination of all major features
  • Make all advanced features compatible with each other

Speculative

decoding

x

x

x

x

All kinds of

parallelism

PD

disaggregation

Improving Compatibility

Spec V2

(in progress)

PP/EP

Refactor

All kinds of

memory pool

Mem V2

(in progress)

Overlap

scheduler

Improving Compatibility

13 of 14

Performance & Architecture Improvements

Parallelism

Multimodality

  • Overlap scheduler for spec decoding (default)
  • Prefill CUDA graph (default)
  • Unified memory pool for hybrid models
  • Mixed chunked prefill refactor
  • Torch compile stack
  • SRT core/plugin refactor
  • DP attention backend refactor

Roadmap (2026 Q1)

  • Pipeline parallelism refactor
  • Expert parallelism refactor
  • Context parallelism support
  • WideEP enhancement on GB200/GB300
  • Multimodal Extensions
  • SGLang Diffusion
  • SGLang Omni
  • Cookbook

Check it out on SGLang Github!

14 of 14

Question & Answer

Starred

22.9K

https://www.sglang.io/

Follow and ⭐ star us!