vLLM Office Hours #28
Special Topic: Intro to GuideLLM: Evaluate your LLM Deployments for Real-World Inference
Michael Goin
Principal Software Engineer, Red Hat
vLLM Committer
Jenny Yi
Product Manager, Red Hat
Mark Kurtz
Technical Staff Member, Red Hat
Former CTO Neural Magic
1
Update confidential designator here
Version number here V00000
Welcome!
A few housekeeping items before we start…
2
Where are you dialing in from?
🇧🇷 🇮🇳 🇺🇸 🇨🇳 🇬🇧 🇨🇦 🇫🇷 🇩🇪 🇲🇽 🇵🇱 🇪🇦 🇭🇷 🇨🇭 🇬🇷 🇰🇷 🇮🇱 🇦🇴 🏴 🇦🇪 🇵🇰 🇷🇸 🇹🇷 🇷🇴 🇱🇻 🇳🇿 🇸🇪 🇱🇻 🇮🇪 🇮🇹 🇩🇰 🇵🇹 🇺🇦 🇦🇷 🇳🇱 🇫🇮 🇨🇱 🇨🇴
Update confidential designator here
Version number here V00000
What’s new in the past two weeks?
Upcoming vLLM Office Hours Sessions
Register for all sessions here.
View previous recordings here.
vLLM Project Update
3
Update confidential designator here
Version number here V00000
About vLLM
The fastest growing de-facto standard in open source model serving
4
Llama
Granite
Mistral
DeepSeek
Qwen
Gemma
CUDA
ROCm
Gaudi/XPU
TPU
Neuron
CPU
$ uv pip install vllm --torch-backend=auto
$ vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8
Update confidential designator here
Version number here V00000
About vLLM
The fastest growing de-facto standard in open source model serving
Fast and easy to use open source inference server supporting:
Full coverage of inference features:
5
Update confidential designator here
Version number here V00000
Red Hat’s vLLM contributions
Red Hat is a top commercial contributor and produces enterprise-ready distributions of vLLM
6
Inference Kernels
Compression Integrations
Production Features
torch.compile
System Overhead
Update confidential designator here
Version number here V00000
New vLLM blogs
Compression techniques target and reduce multiple bottlenecks for inference, including compute bandwidth, memory bandwidth, and total memory footprint.
GuideLLM is an open source toolkit for benchmarking LLM deployment performance by simulating real-world traffic and measuring key metrics like throughput and latency.
GuideLLM: Evaluate LLM deployments for real-world inference
The hidden cost of large language models
Axolotl meets LLM Compressor: Fast, sparse, open
Axolotl and LLM Compressor offer open source, productionized research solutions that address the core pain points in modern LLM workflows.
7
Update confidential designator here
Version number here V00000
llm-d Community Update - June 2025
We're excited to announce our new YouTube channel! �We've been recording our SIG meetings and creating tutorial content to help you get started with llm-d.
Get Involved
llm-d Community Roadmap Survey
We are looking to learn more about:
New YouTube Channel
8
Update confidential designator here
Version number here V00000
Intro to Guidellm
Evaluate LLM Deployments for Real-World Inference
Jenny Yi
Product Manager @ Red Hat
Mark Kurtz
Member of Technical Staff @ Red Hat
Former CTO at Neural Magic (acquired)
Today’s special topic:
9
Agenda
10
Agenda Today
Background on the rise of LLM adoption and production use-cases
Goals, User Flows, Use Cases
Deployments, Metrics, Rate Types, Datasets
Console and other data formats
Components under the hood
Modular Components, APIs, Current Users, Feature Roadmap
Evolving LLM Adoption in the Enterprise
LLM Adoption Today
Source:
https://www.gartner.com/en/newsroom/press-releases/2025-03-31-gartner-forecasts-worldwide-genai-spending-to-reach-644-billion-in-2025
Insights from real-world LLM deployments
11
$644B
Projected GenAI spending in 2025
The Inference Problem
LLM Adoption Today
Common Pain Points
Delivering production-ready LLMs means navigating tradeoffs between model quality, responsiveness, and cost.
In practice, optimizing for any two hurts the third:
Choosing the right model, performance targets, and hardware setup is complex; without clear measurements, it’s hard to make informed decisions.
12
From Possibilities to Priorities
LLM Adoption Today
Fast, conversational response is key
TTFT: ≤ 200ms for 99% of requests
ITL: ≤ 50ms for 99% of requests
E-commerce Chatbot
Accuracy and completeness matter more than speed
TTFT ≤ 300ms, ITL ≤ 100ms (if streamed)
Request Latency ≤ 3000ms (99%)
RAG System
How fast is fast enough and who decides?
13
From Priorities to Precision
LLM Adoption Today
Why real-world data matters for accurate decision making
14
Evaluate and Optimize your LLM Deployments for Real-World Inference
15
is purpose built to solve these challenges
Mission
16
From Guesswork to Guidance
Mission
GuideLLM helps you benchmark models under flexible configurations to understand performance limits and optimize for cost, latency, and quality.
How GuideLLM addresses the problem
17
Optional subheading
User flow
Mission
18
2
1
3
4
Use Cases
Mission
Track performance deltas triggered by model updates and performance alerts.
Regression & A/B Testing
Predict infrastructure needs under projected user loads.
Cost & Capacity planning
Inform model selection, size and family
Pre-deployment
Compare latency and throughput across GPU/CPU instances
Hardware evaluation
When should you use GuideLLM?
19
"With NVIDIA H200, should I use a Llama 3.1 8B or 70B Instruct to create a customer service chatbot?"
How many servers do I need to keep my service running under maximum load?"
“How much more traffic can Llama 3.1 8B Instruct FP8 handle over the baseline?”
“What is the max RPS that my hardware can handle before my performance begins to degrade?"
How does GuideLLM simulate Real-World LLM Workloads?
Optional section marker
20
Deployment & Metrics
https://arxiv.org/abs/2407.12391
Workload Simulation
What does Guidellm Capture
21
vLLM Example:
TGI Example:
Deployment & Metrics
https://github.com/neuralmagic/guidellm/blob/main/docs/metrics.md
What does Guidellm Capture
Statistical Summaries
Token Metrics
Performance Metrics
Request Status metrics
22
Source:
Insert source data here
Insert source data here
Setting up the deployment
23
Runs a single stream of requests one at a time
Throughput
Synchronous
Runs all requests in parallel to measure the maximum throughput for the server
Sweep
Runs synchronous (min), throughput (max), and a series of benchmarks equally spaced between the two
Simulated Workload Rate Types
Sends requests at a rate following a Poisson distribution with the mean set by user
Concurrent
Poisson
Runs a fixed number of streams of requests in parallel
Constant
Sends requests asynchronously at a constant rate set by user
Datasets
Prompt: 512 Output: 256
Chat
Prompt: 512 Output: 512
Code Generation
Prompt: 1024 Output: 256
Summarization
Prompt: 4096 Output: 512
RAG
Simulate the profile of traffic/requests
Synthetic Data
24
Hugging Face Datasets: ie. ShareGPT
File-based Datasets
In-memory Datasets
Configuring Workload Example
Use Case: Chat
Model: Llama-3.1-8B-Instruct-quantized.w4a16
Problem:
"What's the maximum RPS my 1xL40 setup can handle while still meeting my SLOs?
Dataset: Synthetic
Dataset Profile: 512 prompt/256 output
Rate Type: Sweep
SLO: TTFT <200ms in 99%
Let’s build a chatbot
25
$ vllm serve RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16
$ guidellm benchmark \
--target "http://localhost:8000" \
--rate-type sweep \
--max-seconds 30 \
--data "prompt_tokens=512,output_tokens=256"
How do I analyze and interpret the results?
Optional section marker
26
Using the chat example we set up earlier, let’s run a benchmark.
When generating, you can preview the progress and generation time.
27
guidellm benchmark \
--target "http://localhost:8000" \
--rate-type sweep \
--max-seconds 30 \
--data "prompt_tokens=512,output_tokens=256"
Outputs
28
Overall Metadata
Metadata per Benchmark
Benchmark Statistics
Analyzing results using the Console
As our target SLO is TTFT <200 ms for 99% of request, the table shows us 10.69 RPS.
This test gives us an empirical per-node RPS limit for staying within our SLO, which can inform horizontal scaling in production.
29
Outputs
What other output formats does GuideLLM support?
30
UI
Console
.json
.yaml
.csv
Using .csv Output for Custom Visualizations
Useful for comparing across hardware configurations, compressions, models, etc.
31
UI Demo
32
How can we trust these results?
Optional section marker
33
Accurate Scheduling at Any Scale
Your benchmarks reflect system performance not framework limitations
34
Every Metric Precisely Measured
You get complete visibility, not just a summary
35
Auditable, Transparent Benchmarking
Every run is reproducible and every result is auditable
36
How can I expand on GuideLLM?
Optional section marker
37
Architecture Diagram
Architecture Diagram
[list of all components]
38
Built to Plug in Anywhere
Flexible integration for real-world pipelines
39
Modular by Design
Custom Data, Backends, and Load Strategies
40
What next?
How you might extend GuideLLM
Use Beyond the CLI
from guidellm.benchmark.run
import run_benchmark
GuideLLM is open-source — try it, file a feature request or issue, or contribute your first PR!
41
Internal adoption, Customer 0
Red Hat AI Model Validation program uses GuideLLM as the benchmarking tool to test models, generate and visualize data on performance expectations for our customers.
Prior to Red Hat, blogs and research created by Neural Magic also used Guidellm.
Who’s Using it?
42
Who’s Using it?
Shout out to Charles Frye from Modal Labs in publishing this amazing interactive article using Guidellm on benchmarking inference engines. �
43
Future Roadmap
What we have planned…
44
Let’s have a discussion!
45
Update confidential designator here
Version number here V00000
Get involved with the vLLM Community
Contribute to key vLLM features�Comment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags.
Give Us Feedback�We’ll email you today’s recording as soon as it’s ready. Respond and tell us what we are doing right and what we can do better with vLLM office hours. Or comment on this slide!
Join vLLM Developer Slack�Ask questions and engage with us via Slack. Join here.
Join Red Hat’s vLLM Mission�Red Hat wants to bring open-source LLMs and vLLM to every enterprise on the planet. We are looking for vLLM Engineers to help us accomplish our mission. Apply here.
46
Update confidential designator here
Version number here V00000
Thank you, and see you in two weeks!
Michael Goin
Principal Software Engineer, Red Hat
vLLM Committer
Jenny Yi
Product Manager, Red Hat
Mark Kurtz
Technical Staff Member, Red Hat
47
Update confidential designator here
Version number here V00000