1 of 47

vLLM Office Hours #28

Special Topic: Intro to GuideLLM: Evaluate your LLM Deployments for Real-World Inference

Michael Goin

Principal Software Engineer, Red Hat

vLLM Committer

Jenny Yi

Product Manager, Red Hat

Mark Kurtz

Technical Staff Member, Red Hat

Former CTO Neural Magic

1

Update confidential designator here

Version number here V00000

2 of 47

Welcome!

A few housekeeping items before we start…

  • Let’s keep it interactive - speak up and engage in chat
  • This session is being recorded and live streamed
    • Find the recording on YouTube
    • Ask follow-up questions in the vLLM Developer Slack
    • Help us spread the word by sharing the live posts with your networks

2

Where are you dialing in from?

🇧🇷 🇮🇳 🇺🇸 🇨🇳 🇬🇧 🇨🇦 🇫🇷 🇩🇪 🇲🇽 🇵🇱 🇪🇦 🇭🇷 🇨🇭 🇬🇷 🇰🇷 🇮🇱 🇦🇴 🏴󠁧󠁢󠁳󠁣󠁴󠁿 🇦🇪 🇵🇰 🇷🇸 🇹🇷 🇷🇴 🇱🇻 🇳🇿 🇸🇪 🇱🇻 🇮🇪 🇮🇹 🇩🇰 🇵🇹 🇺🇦 🇦🇷 🇳🇱 🇫🇮 🇨🇱 🇨🇴

Update confidential designator here

Version number here V00000

3 of 47

What’s new in the past two weeks?

Upcoming vLLM Office Hours Sessions

  • [TODAY] Intro to GuideLLM: Evaluate your LLM Deployments for Real-World Inference
  • [July 10] Break
  • [July 24] Any suggestions?

Register for all sessions here.

View previous recordings here.

vLLM Project Update

  • New community blogs
    • LLM Compressor + Axolotl
    • GuideLLM
    • llm-d

3

Update confidential designator here

Version number here V00000

4 of 47

About vLLM

The fastest growing de-facto standard in open source model serving

4

Llama

Granite

Mistral

DeepSeek

Qwen

Gemma

CUDA

ROCm

Gaudi/XPU

TPU

Neuron

CPU

$ uv pip install vllm --torch-backend=auto

$ vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8

Update confidential designator here

Version number here V00000

5 of 47

About vLLM

The fastest growing de-facto standard in open source model serving

Fast and easy to use open source inference server supporting:

  • All key model families
  • SOTA inference acceleration research
  • Diverse hardware backends like NVIDIA GPUs, AMD GPUs, Google TPUs, AWS Neuron, Intel Gaudi/XPU, Huawei Ascend, x86/ARM CPUs.

Full coverage of inference features:

  • Text, Multimodal, Hybrid Attention, Embeddings, Reward, Rerank
  • Quantization: INT8, FP8, INT4, FP4, KV Cache
  • CUTLASS, Triton, torch.compile
  • Chunked Prefill, Prefix Caching, Multi LoRA, Speculative Decoding, Disaggregated Prefill
  • Tool/Function Calling, Structured Outputs
  • Tensor, Pipeline, Expert, and Data Parallelism

5

Update confidential designator here

Version number here V00000

6 of 47

Red Hat’s vLLM contributions

Red Hat is a top commercial contributor and produces enterprise-ready distributions of vLLM

6

Inference Kernels

Compression Integrations

Production Features

torch.compile

System Overhead

  • FP8/INT8 CUTLASS scaled_mm
  • W4A16 Marlin, Machete
  • 2:4 W8A8 Sparse Linear
  • MoE Grouped GEMM
  • Torch Dynamo integration
  • Torch Inductor integration
  • Custom fusion passes: Quantize ops fusion, GEMM<>collective fusion
  • v0.6.0 work: MQLLM, Async Output Processing
  • V1 work: AsyncLLM, EngineCore, Tensor/Expert/Data Parallelism
  • llm-compressor framework
  • compressed-tensors
  • W8A8, GPTQ, AWQ, HQQ in vLLM
  • Quantized model repository
  • Robust evaluations
  • Prometheus metrics
  • Grafana dashboarding
  • Open-source integrations (structured output, tool calling, new models)

Update confidential designator here

Version number here V00000

7 of 47

New vLLM blogs

Compression techniques target and reduce multiple bottlenecks for inference, including compute bandwidth, memory bandwidth, and total memory footprint.

Read here.

GuideLLM is an open source toolkit for benchmarking LLM deployment performance by simulating real-world traffic and measuring key metrics like throughput and latency.

Read here.

GuideLLM: Evaluate LLM deployments for real-world inference

The hidden cost of large language models

Axolotl meets LLM Compressor: Fast, sparse, open

Axolotl and LLM Compressor offer open source, productionized research solutions that address the core pain points in modern LLM workflows.

Read here.

7

Update confidential designator here

Version number here V00000

8 of 47

llm-d Community Update - June 2025

We're excited to announce our new YouTube channel! �We've been recording our SIG meetings and creating tutorial content to help you get started with llm-d.

Link here.

Get Involved

llm-d Community Roadmap Survey

We are looking to learn more about:

  • Your Serving Environment
  • Your Model Strategy
  • Your Performance Requirements
  • Your Future Needs

Link here.

New YouTube Channel

8

Update confidential designator here

Version number here V00000

9 of 47

Intro to Guidellm

Evaluate LLM Deployments for Real-World Inference

Jenny Yi

Product Manager @ Red Hat

Mark Kurtz

Member of Technical Staff @ Red Hat

Former CTO at Neural Magic (acquired)

Today’s special topic:

9

10 of 47

Agenda

10

Agenda Today

  • LLM Inference Pain Points

Background on the rise of LLM adoption and production use-cases

  • GuideLLM’s Mission

Goals, User Flows, Use Cases

  • Configuring Real-World Workloads

Deployments, Metrics, Rate Types, Datasets

  • Analyzing Benchmark Results

Console and other data formats

  • Architecture

Components under the hood

  • What’s Next

Modular Components, APIs, Current Users, Feature Roadmap

11 of 47

Evolving LLM Adoption in the Enterprise

LLM Adoption Today

Source:

https://www.gartner.com/en/newsroom/press-releases/2025-03-31-gartner-forecasts-worldwide-genai-spending-to-reach-644-billion-in-2025

Insights from real-world LLM deployments

  • Enterprises are rapidly moving from experimentation to production in LLM deployments for use cases like RAG, chat, and code assistants.
  • As these use cases are integrated, they increasingly become integral to core workflows and user-facing products.
  • Evaluating and customizing a broader range of models becomes critical to better align with application demands and internal cost control.
  • The industry is entering a new phase of LLM adoption: one focused on making intelligent experiences accurate and scalable.

11

$644B

Projected GenAI spending in 2025

12 of 47

The Inference Problem

LLM Adoption Today

Common Pain Points

Delivering production-ready LLMs means navigating tradeoffs between model quality, responsiveness, and cost.

In practice, optimizing for any two hurts the third:

  • High accuracy and low latency → high cost
  • Low cost and high accuracy → high latency
  • Low cost and low latency → low accuracy

Choosing the right model, performance targets, and hardware setup is complex; without clear measurements, it’s hard to make informed decisions.

12

13 of 47

From Possibilities to Priorities

LLM Adoption Today

Fast, conversational response is key

TTFT: ≤ 200ms for 99% of requests

ITL: ≤ 50ms for 99% of requests

E-commerce Chatbot

Accuracy and completeness matter more than speed

TTFT ≤ 300ms, ITL ≤ 100ms (if streamed)

Request Latency ≤ 3000ms (99%)

RAG System

How fast is fast enough and who decides?

  • Choosing the right model and hardware setup creates a massive search space
  • Defining key performance and quality thresholds narrows it to what’s actually usable
  • Service level objectives ensure applications stay fast, useful, and trustworthy for end users
  • Once defined, SLOs guide structured comparisons across models and hardware enabling cost optimization

13

14 of 47

From Priorities to Precision

LLM Adoption Today

Why real-world data matters for accurate decision making

  • The stages of LLM inference depend on token lengths and stresses hardware differently
    • Prefill is more compute bound
    • Decode is more memory bound
  • Using unrealistic token distributions skews performance and cost estimates
  • Additionally, it impacts key techniques in modern LLM serving
    • Structured generation, Speculative decoding, Prefix caching, Session caching

14

15 of 47

Evaluate and Optimize your LLM Deployments for Real-World Inference

15

is purpose built to solve these challenges

16 of 47

Mission

16

17 of 47

From Guesswork to Guidance

Mission

  • Compare model variants to find the best fit for your application
  • Understand how hardware and scaling impact latency and cost
  • Set realistic performance SLOs based on real-world benchmarks for your data

GuideLLM helps you benchmark models under flexible configurations to understand performance limits and optimize for cost, latency, and quality.

How GuideLLM addresses the problem

17

18 of 47

Optional subheading

User flow

  1. Model Selection or Customization
  2. Dataset selection or creation (synthetic data generation)
  3. Configure workload and benchmark
  4. If the model meets your desired SLO on your hardware, deploy on vLLM!

Mission

18

2

1

3

4

19 of 47

Use Cases

Mission

Track performance deltas triggered by model updates and performance alerts.

Regression & A/B Testing

Predict infrastructure needs under projected user loads.

Cost & Capacity planning

Inform model selection, size and family

Pre-deployment

Compare latency and throughput across GPU/CPU instances

Hardware evaluation

When should you use GuideLLM?

19

"With NVIDIA H200, should I use a Llama 3.1 8B or 70B Instruct to create a customer service chatbot?"

How many servers do I need to keep my service running under maximum load?"

“How much more traffic can Llama 3.1 8B Instruct FP8 handle over the baseline?”

“What is the max RPS that my hardware can handle before my performance begins to degrade?"

20 of 47

How does GuideLLM simulate Real-World LLM Workloads?

Optional section marker

20

21 of 47

Deployment & Metrics

https://arxiv.org/abs/2407.12391

Workload Simulation

What does Guidellm Capture

  • Supports OpenAI-Compatible HTTP servers, such as vLLM and Text Generation Interface (TGI)
    • Contributions are very welcome for more support!

21

vLLM Example:

TGI Example:

22 of 47

Deployment & Metrics

https://github.com/neuralmagic/guidellm/blob/main/docs/metrics.md

What does Guidellm Capture

  • Summary Stats: Mean, Median, Var…
  • Percentiles: from p001 to p999

Statistical Summaries

  • Prompt and output token

Token Metrics

  • RPS, Request Concurrency
  • Output Tokens Per Second
  • Total Tokens Per Second
  • Request Latency
  • Time to First Token (TTFT)
  • Inter-token Latency (ITL)
  • Time Per Output Token

Performance Metrics

  • Successful, Incomplete, Error Requests
  • Requests made

Request Status metrics

22

23 of 47

Source:

Insert source data here

Insert source data here

Setting up the deployment

23

Runs a single stream of requests one at a time

Throughput

Synchronous

Runs all requests in parallel to measure the maximum throughput for the server

Sweep

Runs synchronous (min), throughput (max), and a series of benchmarks equally spaced between the two

Simulated Workload Rate Types

Sends requests at a rate following a Poisson distribution with the mean set by user

Concurrent

Poisson

Runs a fixed number of streams of requests in parallel

Constant

Sends requests asynchronously at a constant rate set by user

24 of 47

Datasets

Prompt: 512 Output: 256

Chat

Prompt: 512 Output: 512

Code Generation

Prompt: 1024 Output: 256

Summarization

Prompt: 4096 Output: 512

RAG

Simulate the profile of traffic/requests

Synthetic Data

24

Hugging Face Datasets: ie. ShareGPT

File-based Datasets

In-memory Datasets

25 of 47

Configuring Workload Example

Use Case: Chat

Model: Llama-3.1-8B-Instruct-quantized.w4a16

Problem:

"What's the maximum RPS my 1xL40 setup can handle while still meeting my SLOs?

Dataset: Synthetic

Dataset Profile: 512 prompt/256 output

Rate Type: Sweep

SLO: TTFT <200ms in 99%

Let’s build a chatbot

25

$ vllm serve RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16

$ guidellm benchmark \

--target "http://localhost:8000" \

--rate-type sweep \

--max-seconds 30 \

--data "prompt_tokens=512,output_tokens=256"

26 of 47

How do I analyze and interpret the results?

Optional section marker

26

27 of 47

Using the chat example we set up earlier, let’s run a benchmark.

When generating, you can preview the progress and generation time.

27

guidellm benchmark \

--target "http://localhost:8000" \

--rate-type sweep \

--max-seconds 30 \

--data "prompt_tokens=512,output_tokens=256"

28 of 47

Outputs

  • Overall Metadata
  • Metadata per Benchmark (rate type)
  • Comprehensive Benchmark Statistics:
    • Percentiles
    • Summary statistics

28

Overall Metadata

Metadata per Benchmark

Benchmark Statistics

29 of 47

Analyzing results using the Console

As our target SLO is TTFT <200 ms for 99% of request, the table shows us 10.69 RPS.

This test gives us an empirical per-node RPS limit for staying within our SLO, which can inform horizontal scaling in production.

29

30 of 47

Outputs

What other output formats does GuideLLM support?

30

UI

Console

.json

.yaml

.csv

31 of 47

Using .csv Output for Custom Visualizations

Useful for comparing across hardware configurations, compressions, models, etc.

31

32 of 47

UI Demo

32

33 of 47

How can we trust these results?

Optional section marker

33

34 of 47

Accurate Scheduling at Any Scale

Your benchmarks reflect system performance not framework limitations

  • Python-based toolkit compatible with popular datasets, ML pipelines, and orchestration technologies
  • Hybrid use of multiprocessing and multithreading to bypass Python GIL limits and AsyncIO bottlenecks
  • Ensures optimal use of compute for I/O bound tasks
  • Tested to 1,000+ RPS from a single node
    • 99.9% accuracy to scheduled request start times
    • <0.2% overhead with concurrent/synchronous pipelines

34

35 of 47

Every Metric Precisely Measured

You get complete visibility, not just a summary

  • Uses exact probability and cumulative distribution functions for every metric
  • Every measurement is kept to ensure no sampling errors from estimations
  • Full suite of statistical metrics: mean, median, mode, variance, percentiles (including tails), and more
  • All raw request-level results can be kept for deeper analysis

35

36 of 47

Auditable, Transparent Benchmarking

Every run is reproducible and every result is auditable

  • Deterministic pathways used by default such as random seeds
  • Built in performance timing and measurements throughout the system for tracking
    • Request target start vs actual start
    • Scheduling overheads
    • HTTP overheads
    • And more
  • Optional live display of system delays and request-level timing
  • Aggregate system-level performance metrics included in all output reports

36

37 of 47

How can I expand on GuideLLM?

Optional section marker

37

38 of 47

Architecture Diagram

Architecture Diagram

[list of all components]

38

39 of 47

Built to Plug in Anywhere

Flexible integration for real-world pipelines

  • Supports both CLI and Python APIs for seamless integration into notebooks, scripts, CI pipelines, and more
  • One-to-one compatibility of arguments across all entrypoints
  • Python entrypoints are modular and broken into clear stages for deeper control and customized workflows
  • Minimal setup required, just install, import, and run

39

40 of 47

Modular by Design

Custom Data, Backends, and Load Strategies

  • Extend datasets via the DatasetCreator interface (outputs HuggingFace datasets)
  • Create new SchedulingStrategy and Profile classes to simulate specific workloads from simple such as Synchronous to complicated such as Sweep
  • Customize the Backend interface for benchmarking any GenAI pathway – currently OpenAI HTTP supported
  • All requests and result objects extend Pydantic for base support
  • All designed to work with the Scheduler and Benchmarker without any further customizations

40

41 of 47

What next?

How you might extend GuideLLM

Use Beyond the CLI

  • Python-native: Import core modules and call benchmark functions directly

from guidellm.benchmark.run

import run_benchmark

  • Composable in scripts or pipelines: great for custom workflows, notebooks, integrations
  • Ie. Build a wrapper around the benchmarking module to auto-run tests prior to model deployments

GuideLLM is open-source — try it, file a feature request or issue, or contribute your first PR!

  • Add new benchmarks, rate types, or dataset formats

41

42 of 47

Internal adoption, Customer 0

Red Hat AI Model Validation program uses GuideLLM as the benchmarking tool to test models, generate and visualize data on performance expectations for our customers.

Prior to Red Hat, blogs and research created by Neural Magic also used Guidellm.

Red Hat AI Hugging Face

Who’s Using it?

42

43 of 47

Who’s Using it?

Shout out to Charles Frye from Modal Labs in publishing this amazing interactive article using Guidellm on benchmarking inference engines. �

LLM Engineer’s Almanac

43

44 of 47

Future Roadmap

What we have planned…

  • Support for complex use cases like multi-turn chat and multimodal inputs (image, audio)
  • Automatic saturation detection to identify performance limits and stabilize benchmarks
  • Built-in accuracy evaluations alongside performance and cost
  • Python backend support for native integration with vLLM
  • Standardized deployment scenarios
  • Distributed benchmarking support across nodes and clusters

44

45 of 47

Let’s have a discussion!

45

  • Questions?

  • Feedback?

  • Feature requests?

  • How are you using vLLM?

  • How can we make office hours better?

Update confidential designator here

Version number here V00000

46 of 47

Get involved with the vLLM Community

Contribute to key vLLM featuresComment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags.

Give Us FeedbackWe’ll email you today’s recording as soon as it’s ready. Respond and tell us what we are doing right and what we can do better with vLLM office hours. Or comment on this slide!

Join vLLM Developer SlackAsk questions and engage with us via Slack. Join here.

Join Red Hat’s vLLM MissionRed Hat wants to bring open-source LLMs and vLLM to every enterprise on the planet. We are looking for vLLM Engineers to help us accomplish our mission. Apply here.

46

Update confidential designator here

Version number here V00000

47 of 47

Thank you, and see you in two weeks!

Michael Goin

Principal Software Engineer, Red Hat

vLLM Committer

Jenny Yi

Product Manager, Red Hat

Mark Kurtz

Technical Staff Member, Red Hat

47

Update confidential designator here

Version number here V00000