KServe Next:
Advancing Generative AI Model Serving
Dan Sun, Engineering Team Lead, Bloomberg;
Co-Founder, KServe
Yuan Tang, Senior Principal Software Engineer, Red Hat;
Project Co-Lead, KServe
#KubeCon #CloudNativeCon
Model inference platform
Supported runtimes
Orchestration
Hardware accelerators (GPUs, CPUs, etc.)
Cloud native integrations
Autoscaling
Networking
Generative
Predictive
GenAI integrations
Our Journey
Our Journey
2019:
Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon under the Kubeflow project as KFServing.
Our Journey
2021:
Rebranded from KFServing to standalone KServe project.
Our Journey
2022: KServe joined LF AI & Data Foundation.
Our Journey
Sept 2025: KServe was accepted as a CNCF incubating project!
What is KServe?
Assembling AI with KServe
���KServe is a platform that unifies Generative and Predictive AI Inference on Kubernetes. It is simple enough for quick deployments, yet powerful enough to handle enterprise-scale AI workloads with advanced features.
AI Inference Traffic 🚦
Feature | Predictive AI Traffic | Generative AI Traffic |
Task | Classification, Regression | Generate new data: text, image, code |
Input Traffic | Small, structured and high volume | Variable, text/multi-modal Traffic can be bursty |
Output Traffic | Small with single label or score | Large, streaming, high bandwidth |
Latency | Ultra-low latency | Moderate to high latency |
Computation/ Scaling | CPU or GPU on L4 for transformer models Horizontal scaling and maximize QPS | Heavy computation on H100/H200 with high memory bandwidth Efficient GPU/VRAM utilization |
KServe GenAI: Advanced Building Blocks for LLMs
LLM Metric-Based Autoscaling
Prompt Caching
Intelligent Routing, Traffic Management
GenAI Runtime
vLLM, TRT-LLM,�llm-d
Scale
Cost
Latency /
Throughput
Efficiency
KServe GenAI Architecture
Highlighted
Features
Why Concurrency Fails for LLMs?
LLMs run on expensive GPUs; idle models are massive wastes of money
Why Concurrency Fails ?��Bottleneck is the GPU VRAM for KV Cache
A single long prompt can fill VRAM, but KPA only sees concurrency = 1
Default KServe Autoscaler
In-flight concurrency
�A request is concurrent from the moment it is received until the last token is sent
KServe Metric based Autoscaler
�GPU Utilization %
Measure the compute load, which is a good secondary metric
KV Cache VRAM %
When VRAM is full, the model can’t accept more requests; we must scale before hitting 100%
predictor:
model:
modelFormat:
name: huggingface
storageUri: "hf://Qwen2.5"
minReplicas: 1
maxReplicas: 5
autoScaling:
metrics:
- type: PodMetric
podmetric:
metric:
query:
"vllm:kv_cache_usage_percent"
target:
type: Value
value: "75"
Envoy AI Gateway: Token Rate Limiting
Diagram from Envoy AI Gateway website
Rate Limit for LLMs
Token usage extraction from response body
The “cost” is stored in the envoy dynamic metadata
Envoy AI Gateway: Inference Extension
Traditional load balancing falls short for LLM inference workloads
Intelligent endpoint picking with KV cache usage, LoRA adapter info and LLM load information
Diagram by Erica from Envoy AI Gateway website
GenAI Feature: llm-d Integration
An open-source framework for distributed large language model (LLM) inference that runs natively on Kubernetes
GenAI Feature: llm-d Integration
Intelligent Inference Scheduler
Prefix Caching
P/D Disaggregated Serving
Variant Autoscaling
LLM Inference Workload
Single-Node/Multi-GPU:
SLM (less than 70B)
Multi-Node/Multi-GPU: Distributed Inference for large models
Disaggregated Prefill/Decode: High Throughput Requirement
LLM Inference Control Plane API
Model Spec: Model URL, LORA
�Router Spec: Gateway API, Scheduler
�Workload Spec: Single/Multi Node, Prefill/Decode
Parallelism Spec: TP/DP/EP
LLM Inference Data Plane
Intelligent Endpoint Picking�
Roadmap
Unified Inference Fabric
Core Objectives: Unified Inference Fabric
KServe: Gateway API to Unify Inference Workload
Community
19 Maintainers and 300+ Contributors!
Trusted by industry leaders
KServe is used in production by organizations across various industries, providing reliable model inference at scale.
Join Our Community
https://github.com/kserve/kserve