1 of 31

KServe Next:

Advancing Generative AI Model Serving

Dan Sun, Engineering Team Lead, Bloomberg;

Co-Founder, KServe

Yuan Tang, Senior Principal Software Engineer, Red Hat;

Project Co-Lead, KServe

#KubeCon #CloudNativeCon

2 of 31

Model inference platform

Supported runtimes

Orchestration

Hardware accelerators (GPUs, CPUs, etc.)

Cloud native integrations

Autoscaling

Networking

Generative

Predictive

GenAI integrations

3 of 31

Our Journey

4 of 31

Our Journey

2019:

Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon under the Kubeflow project as KFServing.

5 of 31

Our Journey

2021:

Rebranded from KFServing to standalone KServe project.

https://blog.kubeflow.org/release/official/2021/09/27/kfserving-transition.html

6 of 31

Our Journey

2022: KServe joined LF AI & Data Foundation.

https://lfaidata.foundation/blog/2022/02/24/kserve-joins-lf-ai-data-as-new-incubation-project/

7 of 31

Our Journey

Sept 2025: KServe was accepted as a CNCF incubating project!

8 of 31

What is KServe?

9 of 31

Assembling AI with KServe

��KServe is a platform that unifies Generative and Predictive AI Inference on Kubernetes. It is simple enough for quick deployments, yet powerful enough to handle enterprise-scale AI workloads with advanced features.

Dan��Our crafting engine for inference is KServe — a Kubernetes-native platform that unifies how we serve both predictive and generative models.

�1. Simplifies self-hosted model deployment on Kubernetes.

2. Supports custom & out-of-the-box GenAI/ML frameworks.

3. Provides built-in monitoring, autoscaling, and inference optimization.

Discuss the rapid growth and increasing complexity of GenAI models. Highlight the new challenges they pose for infrastructure, such as increased compute demands and scalability.

Serving these giant models isn’t as simple as stacking more blocks.

We face new infrastructure challenges — compute scalability, resource management, and latency optimization.

The goal isn’t just to serve a model once — it’s to keep it fast, reliable, and affordable as usage explodes

10 of 31

KServe Predictive AI Inference

Diagram from Alexa Griffith

11 of 31

AI Inference Traffic 🚦

Feature	Predictive AI Traffic	Generative AI Traffic
Task	Classification, Regression	Generate new data: text, image, code
Input Traffic	Small, structured and high volume	Variable, text/multi-modal Traffic can be bursty
Output Traffic	Small with single label or score	Large, streaming, high bandwidth
Latency	Ultra-low latency	Moderate to high latency
Computation/ Scaling	CPU or GPU on L4 for transformer models Horizontal scaling and maximize QPS	Heavy computation on H100/H200 with high memory bandwidth Efficient GPU/VRAM utilization

Generative AI is our new dimension, instead of making classification, regressions predictions with small structured data, we generate text, image, code with variable multi-modal inputs. The output can be large and delivered in streaming mode. This creativity definitely creates the added complexity and our model has grown in size. It transforms how we think serving and optimizing at every layer. We face new challenges in compute, scalability and resource management.��Key Traffic Management Implications

Predictive AI: Management focuses on minimizing network latency and maximizing QPS through efficient load balancing.
Generative AI: Management focuses on providing high network bandwidth to deliver the large results and ensuring efficient resource scheduling (especially GPUs) to maximize hardware utilization and manage high memory demands. The traffic often involves streaming the output (token by token) to provide a better user experience, which introduces its own complexity in monitoring and network throughput.

12 of 31

KServe GenAI: Advanced Building Blocks for LLMs

LLM Metric-Based Autoscaling

Prompt Caching

Intelligent Routing, Traffic Management

GenAI Runtime

vLLM, TRT-LLM,�llm-d

Scale

Cost

Latency /

Throughput

Efficiency

Now, when we talk about serving LLMs, the recipe gets complex and let's talk about why— we need features like prompt caching, genai runtimes, and intelligent routing.

These are our redstone circuits — the pieces that let us automate, optimize, and scale GenAI inference efficiently.��Focus on how KServe specifically supports Generative AI models.

Discuss challenges and solutions for serving large GenAI models with KServe.

Provide examples of GenAI applications using KServe.

Highlight KServe's GenAI-ready features: LLM metric-based autoscaling, model caching, multi-node inference, and OpenAI protocol support. Explain how these features address the unique demands of GenAI.

LLM Metric-Based Autoscaling: Scales based on LLM metrics like token throughput.

OpenAI Protocol Support: Compatible for chat completions and embedding tasks.

Multi-Node Inference: Supports distributed LLM inference with vLLM.

Model & Prompt Caching: Reduces load times and improves throughput.

Traffic Management: Integrates with Envoy AI Gateway.

13 of 31

KServe GenAI Architecture

14 of 31

Highlighted

Features

15 of 31

Why Concurrency Fails for LLMs?

LLMs run on expensive GPUs; idle models are massive wastes of money

Why Concurrency Fails ?��Bottleneck is the GPU VRAM for KV Cache

A single long prompt can fill VRAM, but KPA only sees concurrency = 1

Default KServe Autoscaler

In-flight concurrency

�A request is concurrent from the moment it is received until the last token is sent

16 of 31

KServe Metric based Autoscaler

�GPU Utilization %

Measure the compute load, which is a good secondary metric

KV Cache VRAM %

When VRAM is full, the model can’t accept more requests; we must scale before hitting 100%

predictor:

model:

modelFormat:

name: huggingface

storageUri: "hf://Qwen2.5"

minReplicas: 1

maxReplicas: 5

autoScaling:

metrics:

- type: PodMetric

podmetric:

metric:

query:

"vllm:kv_cache_usage_percent"

target:

type: Value

value: "75"

17 of 31

Envoy AI Gateway: Token Rate Limiting

Diagram from Envoy AI Gateway website

Rate Limit for LLMs

Token usage extraction from response body

The “cost” is stored in the envoy dynamic metadata

18 of 31

Envoy AI Gateway: Inference Extension

Traditional load balancing falls short for LLM inference workloads

Intelligent endpoint picking with KV cache usage, LoRA adapter info and LLM load information

Diagram by Erica from Envoy AI Gateway website

19 of 31

GenAI Feature: llm-d Integration

An open-source framework for distributed large language model (LLM) inference that runs natively on Kubernetes

https://github.com/llm-d/llm-d

20 of 31

GenAI Feature: llm-d Integration

Intelligent Inference Scheduler

Prefix Caching

P/D Disaggregated Serving

Variant Autoscaling

Yuan

Inference Scheduler: Customizable “smart” load-balancing by implementing the Endpoint Picker Protocol (EPP) to leverage operational telemetry, and filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness.
Disaggregated Serving with vLLM: llm-d leverages vLLM’s support for disaggregated serving to run prefill and decode on independent instances.
Disaggregated Prefix Caching with vLLM: llm-d uses vLLM's KVConnector to provide a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache.
Variant Autoscaling over Hardware, Workload, and Traffic (roadmap): llm-d plans to implement a traffic- and hardware-aware autoscaler that provides the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency.

21 of 31

LLM Inference Workload

Diagram by Jooho Lee from KServe website

Single-Node/Multi-GPU:

SLM (less than 70B)

Multi-Node/Multi-GPU: Distributed Inference for large models

Disaggregated Prefill/Decode: High Throughput Requirement

22 of 31

LLM Inference Control Plane API

Diagram by Jooho Lee from KServe website

Model Spec: Model URL, LORA

�Router Spec: Gateway API, Scheduler

�Workload Spec: Single/Multi Node, Prefill/Decode

Parallelism Spec: TP/DP/EP

23 of 31

LLM Inference Data Plane

Diagram by Jooho Lee from KServe website

Intelligent Endpoint Picking�

Prefix Cache Routing
Prefill-Decode Pool Routing
Load-Aware Routing

24 of 31

Roadmap

25 of 31

Unified Inference Fabric

Core Objectives: Unified Inference Fabric

Centralized platform for serving all open-source and fine-tuned, trained models
Standardized APIs and telemetry for consistency
Fully managed inference experience with optimized performance and efficient compute resource utilization
Secured identity for Agent-to-LLM traffic with SPIFFE/SPIRE

26 of 31

KServe: Gateway API to Unify Inference Workload

27 of 31

Community

28 of 31

19 Maintainers and 300+ Contributors!

https://github.com/kserve/community/blob/main/MAINTAINERS.md

29 of 31

https://github.com/kserve/website/blob/main/docs/community/adopters.md

Trusted by industry leaders

KServe is used in production by organizations across various industries, providing reliable model inference at scale.

30 of 31

Join Our Community

Repo: https://github.com/kserve/kserve
Website: https://kserve.github.io
Biweekly community meetings on Thursdays at 9 AM PST
#kserve and #kserve-contributors channels in the CNCF Slack

https://github.com/kserve/kserve

1 of 31

2 of 31

3 of 31

4 of 31

5 of 31

6 of 31

7 of 31

8 of 31

9 of 31

10 of 31

11 of 31

12 of 31

13 of 31

14 of 31

15 of 31

16 of 31

17 of 31

18 of 31

19 of 31

20 of 31

21 of 31

22 of 31

23 of 31

24 of 31

25 of 31

26 of 31

27 of 31

28 of 31

29 of 31

30 of 31

31 of 31