1 of 47

vLLM Office Hours #28

Special Topic: Intro to GuideLLM: Evaluate your LLM Deployments for Real-World Inference

Michael Goin

Principal Software Engineer, Red Hat

vLLM Committer

Jenny Yi

Product Manager, Red Hat

Mark Kurtz

Technical Staff Member, Red Hat

Former CTO Neural Magic

Update confidential designator here

Version number here V00000

2 of 47

Welcome!

A few housekeeping items before we start…

Let’s keep it interactive - speak up and engage in chat
This session is being recorded and live streamed

Find the recording on YouTube
Ask follow-up questions in the vLLM Developer Slack
Help us spread the word by sharing the live posts with your networks

Where are you dialing in from?

🇧🇷 🇮🇳 🇺🇸 🇨🇳 🇬🇧 🇨🇦 🇫🇷 🇩🇪 🇲🇽 🇵🇱 🇪🇦 🇭🇷 🇨🇭 🇬🇷 🇰🇷 🇮🇱 🇦🇴 🏴󠁧󠁢󠁳󠁣󠁴󠁿 🇦🇪 🇵🇰 🇷🇸 🇹🇷 🇷🇴 🇱🇻 🇳🇿 🇸🇪 🇱🇻 🇮🇪 🇮🇹 🇩🇰 🇵🇹 🇺🇦 🇦🇷 🇳🇱 🇫🇮 🇨🇱 🇨🇴

Update confidential designator here

Version number here V00000

3 of 47

What’s new in the past two weeks?

Upcoming vLLM Office Hours Sessions

[TODAY] Intro to GuideLLM: Evaluate your LLM Deployments for Real-World Inference
[July 10] Break
[July 24] Any suggestions?

View previous recordings here.

vLLM Project Update

New community blogs

LLM Compressor + Axolotl
GuideLLM
llm-d

Update confidential designator here

Version number here V00000

4 of 47

About vLLM

The fastest growing de-facto standard in open source model serving

Llama

Granite

Mistral

DeepSeek

Qwen

Gemma

CUDA

ROCm

Gaudi/XPU

TPU

Neuron

>1000 Contributors

>20 Sponsors

Millions of Downloads

CPU

$ uv pip install vllm --torch-backend=auto

$ vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8

https://github.com/vllm-project/vllm

Update confidential designator here

Version number here V00000

5 of 47

About vLLM

The fastest growing de-facto standard in open source model serving

Fast and easy to use open source inference server supporting:

All key model families
SOTA inference acceleration research
Diverse hardware backends like NVIDIA GPUs, AMD GPUs, Google TPUs, AWS Neuron, Intel Gaudi/XPU, Huawei Ascend, x86/ARM CPUs.

Full coverage of inference features:

Text, Multimodal, Hybrid Attention, Embeddings, Reward, Rerank
Quantization: INT8, FP8, INT4, FP4, KV Cache
CUTLASS, Triton, torch.compile
Chunked Prefill, Prefix Caching, Multi LoRA, Speculative Decoding, Disaggregated Prefill
Tool/Function Calling, Structured Outputs
Tensor, Pipeline, Expert, and Data Parallelism

Update confidential designator here

Version number here V00000

6 of 47

Red Hat’s vLLM contributions

Red Hat is a top commercial contributor and produces enterprise-ready distributions of vLLM

Inference Kernels

Compression Integrations

Production Features

torch.compile

System Overhead

FP8/INT8 CUTLASS scaled_mm
W4A16 Marlin, Machete
2:4 W8A8 Sparse Linear
MoE Grouped GEMM

Torch Dynamo integration
Torch Inductor integration
Custom fusion passes: Quantize ops fusion, GEMM<>collective fusion

v0.6.0 work: MQLLM, Async Output Processing
V1 work: AsyncLLM, EngineCore, Tensor/Expert/Data Parallelism

llm-compressor framework
compressed-tensors
W8A8, GPTQ, AWQ, HQQ in vLLM
Quantized model repository
Robust evaluations

Prometheus metrics
Grafana dashboarding
Open-source integrations (structured output, tool calling, new models)

Update confidential designator here

Version number here V00000

7 of 47

New vLLM blogs

Compression techniques target and reduce multiple bottlenecks for inference, including compute bandwidth, memory bandwidth, and total memory footprint.

Read here.

GuideLLM is an open source toolkit for benchmarking LLM deployment performance by simulating real-world traffic and measuring key metrics like throughput and latency.

Read here.

GuideLLM: Evaluate LLM deployments for real-world inference

The hidden cost of large language models

Axolotl meets LLM Compressor: Fast, sparse, open

Axolotl and LLM Compressor offer open source, productionized research solutions that address the core pain points in modern LLM workflows.

Read here.

Update confidential designator here

Version number here V00000

8 of 47

llm-d Community Update - June 2025

Join the Google Group
Review the project calendar
Join the Slack workspace for real-time discussions
Check out the Contributor Guidelines

We're excited to announce our new YouTube channel! �We've been recording our SIG meetings and creating tutorial content to help you get started with llm-d.

Link here.

Get Involved

llm-d Community Roadmap Survey

We are looking to learn more about:

Your Serving Environment
Your Model Strategy
Your Performance Requirements
Your Future Needs

Link here.

New YouTube Channel

Update confidential designator here

Version number here V00000

9 of 47

Intro to Guidellm

Evaluate LLM Deployments for Real-World Inference

Jenny Yi

Product Manager @ Red Hat

Mark Kurtz

Member of Technical Staff @ Red Hat

Former CTO at Neural Magic (acquired)

Today’s special topic:

10 of 47

Agenda

Agenda Today

LLM Inference Pain Points

Background on the rise of LLM adoption and production use-cases

GuideLLM’s Mission

Goals, User Flows, Use Cases

Configuring Real-World Workloads

Deployments, Metrics, Rate Types, Datasets

Analyzing Benchmark Results

Console and other data formats

Architecture

Components under the hood

What’s Next

Modular Components, APIs, Current Users, Feature Roadmap

11 of 47

Evolving LLM Adoption in the Enterprise

LLM Adoption Today

Source:

https://www.gartner.com/en/newsroom/press-releases/2025-03-31-gartner-forecasts-worldwide-genai-spending-to-reach-644-billion-in-2025

Insights from real-world LLM deployments

Enterprises are rapidly moving from experimentation to production in LLM deployments for use cases like RAG, chat, and code assistants.
As these use cases are integrated, they increasingly become integral to core workflows and user-facing products.
Evaluating and customizing a broader range of models becomes critical to better align with application demands and internal cost control.
The industry is entering a new phase of LLM adoption: one focused on making intelligent experiences accurate and scalable.

$644B

Projected GenAI spending in 2025

12 of 47

The Inference Problem

LLM Adoption Today

Common Pain Points

Delivering production-ready LLMs means navigating tradeoffs between model quality, responsiveness, and cost.

In practice, optimizing for any two hurts the third:

High accuracy and low latency → high cost
Low cost and high accuracy → high latency
Low cost and low latency → low accuracy

Choosing the right model, performance targets, and hardware setup is complex; without clear measurements, it’s hard to make informed decisions.

13 of 47

From Possibilities to Priorities

LLM Adoption Today

Fast, conversational response is key

TTFT: ≤ 200ms for 99% of requests

ITL: ≤ 50ms for 99% of requests

E-commerce Chatbot

Accuracy and completeness matter more than speed

TTFT ≤ 300ms, ITL ≤ 100ms (if streamed)

Request Latency ≤ 3000ms (99%)

RAG System

How fast is fast enough and who decides?

Choosing the right model and hardware setup creates a massive search space
Defining key performance and quality thresholds narrows it to what’s actually usable
Service level objectives ensure applications stay fast, useful, and trustworthy for end users
Once defined, SLOs guide structured comparisons across models and hardware enabling cost optimization

14 of 47

From Priorities to Precision

LLM Adoption Today

Why real-world data matters for accurate decision making

The stages of LLM inference depend on token lengths and stresses hardware differently

Prefill is more compute bound
Decode is more memory bound

Using unrealistic token distributions skews performance and cost estimates
Additionally, it impacts key techniques in modern LLM serving

Structured generation, Speculative decoding, Prefix caching, Session caching

15 of 47

Evaluate and Optimize your LLM Deployments for Real-World Inference

is purpose built to solve these challenges

17 of 47

From Guesswork to Guidance

Mission

Compare model variants to find the best fit for your application
Understand how hardware and scaling impact latency and cost
Set realistic performance SLOs based on real-world benchmarks for your data

GuideLLM helps you benchmark models under flexible configurations to understand performance limits and optimize for cost, latency, and quality.

How GuideLLM addresses the problem

18 of 47

Optional subheading

User flow

Model Selection or Customization
Dataset selection or creation (synthetic data generation)
Configure workload and benchmark
If the model meets your desired SLO on your hardware, deploy on vLLM!

Mission

19 of 47

Use Cases

Mission

Track performance deltas triggered by model updates and performance alerts.

Regression & A/B Testing

Predict infrastructure needs under projected user loads.

Cost & Capacity planning

Inform model selection, size and family

Pre-deployment

Compare latency and throughput across GPU/CPU instances

Hardware evaluation

When should you use GuideLLM?

"With NVIDIA H200, should I use a Llama 3.1 8B or 70B Instruct to create a customer service chatbot?"

How many servers do I need to keep my service running under maximum load?"

“How much more traffic can Llama 3.1 8B Instruct FP8 handle over the baseline?”

“What is the max RPS that my hardware can handle before my performance begins to degrade?"

20 of 47

How does GuideLLM simulate Real-World LLM Workloads?

Optional section marker

21 of 47

Deployment & Metrics

https://arxiv.org/abs/2407.12391

Workload Simulation

What does Guidellm Capture

Supports OpenAI-Compatible HTTP servers, such as vLLM and Text Generation Interface (TGI)

Contributions are very welcome for more support!

vLLM Example:

TGI Example:

22 of 47

Deployment & Metrics

https://github.com/neuralmagic/guidellm/blob/main/docs/metrics.md

What does Guidellm Capture

Summary Stats: Mean, Median, Var…
Percentiles: from p001 to p999

Statistical Summaries

Prompt and output token

Token Metrics

RPS, Request Concurrency
Output Tokens Per Second
Total Tokens Per Second
Request Latency
Time to First Token (TTFT)
Inter-token Latency (ITL)
Time Per Output Token

Performance Metrics

Successful, Incomplete, Error Requests
Requests made

Request Status metrics

23 of 47

Source:

Insert source data here

Setting up the deployment

Runs a single stream of requests one at a time

Throughput

Synchronous

Runs all requests in parallel to measure the maximum throughput for the server

Sweep

Runs synchronous (min), throughput (max), and a series of benchmarks equally spaced between the two

Simulated Workload Rate Types

Sends requests at a rate following a Poisson distribution with the mean set by user

Concurrent

Poisson

Runs a fixed number of streams of requests in parallel

Constant

Sends requests asynchronously at a constant rate set by user

24 of 47

Datasets

Prompt: 512 Output: 256

Chat

Prompt: 512 Output: 512

Code Generation

Prompt: 1024 Output: 256

Summarization

Prompt: 4096 Output: 512

RAG

Simulate the profile of traffic/requests

Synthetic Data

Hugging Face Datasets: ie. ShareGPT

File-based Datasets

In-memory Datasets

25 of 47

Configuring Workload Example

Use Case: Chat

Model: Llama-3.1-8B-Instruct-quantized.w4a16

Problem:

"What's the maximum RPS my 1xL40 setup can handle while still meeting my SLOs?

Dataset: Synthetic

Dataset Profile: 512 prompt/256 output

Rate Type: Sweep

SLO: TTFT <200ms in 99%

Let’s build a chatbot

$ vllm serve RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16

$ guidellm benchmark \

--target "http://localhost:8000" \

--rate-type sweep \

--max-seconds 30 \

--data "prompt_tokens=512,output_tokens=256"

26 of 47

How do I analyze and interpret the results?

Optional section marker

27 of 47

Using the chat example we set up earlier, let’s run a benchmark.

When generating, you can preview the progress and generation time.

guidellm benchmark \

--target "http://localhost:8000" \

--rate-type sweep \

--max-seconds 30 \

--data "prompt_tokens=512,output_tokens=256"

28 of 47

Outputs

Overall Metadata
Metadata per Benchmark (rate type)
Comprehensive Benchmark Statistics:

Percentiles
Summary statistics

Overall Metadata

Metadata per Benchmark

Benchmark Statistics

29 of 47

Analyzing results using the Console

As our target SLO is TTFT <200 ms for 99% of request, the table shows us 10.69 RPS.

This test gives us an empirical per-node RPS limit for staying within our SLO, which can inform horizontal scaling in production.

30 of 47

Outputs

What other output formats does GuideLLM support?

Console

.json

.yaml

.csv

31 of 47

Using .csv Output for Custom Visualizations

Useful for comparing across hardware configurations, compressions, models, etc.

32 of 47

UI Demo

33 of 47

How can we trust these results?

Optional section marker

34 of 47

Accurate Scheduling at Any Scale

Your benchmarks reflect system performance not framework limitations

Python-based toolkit compatible with popular datasets, ML pipelines, and orchestration technologies
Hybrid use of multiprocessing and multithreading to bypass Python GIL limits and AsyncIO bottlenecks
Ensures optimal use of compute for I/O bound tasks
Tested to 1,000+ RPS from a single node

99.9% accuracy to scheduled request start times
<0.2% overhead with concurrent/synchronous pipelines

35 of 47

Every Metric Precisely Measured

You get complete visibility, not just a summary

Uses exact probability and cumulative distribution functions for every metric
Every measurement is kept to ensure no sampling errors from estimations
Full suite of statistical metrics: mean, median, mode, variance, percentiles (including tails), and more
All raw request-level results can be kept for deeper analysis

36 of 47

Auditable, Transparent Benchmarking

Every run is reproducible and every result is auditable

Deterministic pathways used by default such as random seeds
Built in performance timing and measurements throughout the system for tracking

Request target start vs actual start
Scheduling overheads
HTTP overheads
And more

Optional live display of system delays and request-level timing
Aggregate system-level performance metrics included in all output reports

37 of 47

How can I expand on GuideLLM?

Optional section marker

38 of 47

Architecture Diagram

[list of all components]

39 of 47

Built to Plug in Anywhere

Flexible integration for real-world pipelines

Supports both CLI and Python APIs for seamless integration into notebooks, scripts, CI pipelines, and more
One-to-one compatibility of arguments across all entrypoints
Python entrypoints are modular and broken into clear stages for deeper control and customized workflows
Minimal setup required, just install, import, and run

40 of 47

Modular by Design

Custom Data, Backends, and Load Strategies

Extend datasets via the DatasetCreator interface (outputs HuggingFace datasets)
Create new SchedulingStrategy and Profile classes to simulate specific workloads from simple such as Synchronous to complicated such as Sweep
Customize the Backend interface for benchmarking any GenAI pathway – currently OpenAI HTTP supported
All requests and result objects extend Pydantic for base support
All designed to work with the Scheduler and Benchmarker without any further customizations

41 of 47

What next?

How you might extend GuideLLM

Use Beyond the CLI

Python-native: Import core modules and call benchmark functions directly

from guidellm.benchmark.run

import run_benchmark

Composable in scripts or pipelines: great for custom workflows, notebooks, integrations
Ie. Build a wrapper around the benchmarking module to auto-run tests prior to model deployments

GuideLLM is open-source — try it, file a feature request or issue, or contribute your first PR!

Add new benchmarks, rate types, or dataset formats

42 of 47

Internal adoption, Customer 0

Red Hat AI Model Validation program uses GuideLLM as the benchmarking tool to test models, generate and visualize data on performance expectations for our customers.

Prior to Red Hat, blogs and research created by Neural Magic also used Guidellm.

Red Hat AI Hugging Face

Who’s Using it?

43 of 47

Who’s Using it?

Shout out to Charles Frye from Modal Labs in publishing this amazing interactive article using Guidellm on benchmarking inference engines. �

LLM Engineer’s Almanac

44 of 47

Future Roadmap

What we have planned…

Support for complex use cases like multi-turn chat and multimodal inputs (image, audio)
Automatic saturation detection to identify performance limits and stabilize benchmarks
Built-in accuracy evaluations alongside performance and cost
Python backend support for native integration with vLLM
Standardized deployment scenarios
Distributed benchmarking support across nodes and clusters

45 of 47

Let’s have a discussion!

Questions?

Feedback?

Feature requests?

How are you using vLLM?

How can we make office hours better?

Update confidential designator here

Version number here V00000

46 of 47

Get involved with the vLLM Community

Contribute to key vLLM features�Comment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags.

Give Us Feedback�We’ll email you today’s recording as soon as it’s ready. Respond and tell us what we are doing right and what we can do better with vLLM office hours. Or comment on this slide!

Join vLLM Developer Slack�Ask questions and engage with us via Slack. Join here.

Join Red Hat’s vLLM Mission�Red Hat wants to bring open-source LLMs and vLLM to every enterprise on the planet. We are looking for vLLM Engineers to help us accomplish our mission. Apply here.

Update confidential designator here

Version number here V00000

47 of 47

Thank you, and see you in two weeks!

Michael Goin

Principal Software Engineer, Red Hat

vLLM Committer

Jenny Yi

Product Manager, Red Hat

Mark Kurtz

Technical Staff Member, Red Hat

Update confidential designator here

Version number here V00000

1 of 47

2 of 47

3 of 47

4 of 47

5 of 47

6 of 47

7 of 47

8 of 47

9 of 47

10 of 47

11 of 47

12 of 47

13 of 47

14 of 47

15 of 47

16 of 47

17 of 47

18 of 47

19 of 47

20 of 47

21 of 47

22 of 47

23 of 47

24 of 47

25 of 47

26 of 47

27 of 47

28 of 47

29 of 47

30 of 47

31 of 47

32 of 47

33 of 47

34 of 47

35 of 47

36 of 47

37 of 47

38 of 47

39 of 47

40 of 47

41 of 47

42 of 47

43 of 47

44 of 47

45 of 47

46 of 47

47 of 47