1 of 31

KServe Next:

Advancing Generative AI Model Serving

Dan Sun, Engineering Team Lead, Bloomberg;

Co-Founder, KServe

Yuan Tang, Senior Principal Software Engineer, Red Hat;

Project Co-Lead, KServe

#KubeCon #CloudNativeCon

2 of 31

Model inference platform

Supported runtimes

Orchestration

Hardware accelerators (GPUs, CPUs, etc.)

Cloud native integrations

Autoscaling

Networking

Generative

Predictive

GenAI integrations

3 of 31

Our Journey

4 of 31

Our Journey

2019:

Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon under the Kubeflow project as KFServing.

5 of 31

Our Journey

2021:

Rebranded from KFServing to standalone KServe project.

6 of 31

Our Journey

2022: KServe joined LF AI & Data Foundation.

7 of 31

Our Journey

Sept 2025: KServe was accepted as a CNCF incubating project!

8 of 31

What is KServe?

9 of 31

Assembling AI with KServe

���KServe is a platform that unifies Generative and Predictive AI Inference on Kubernetes. It is simple enough for quick deployments, yet powerful enough to handle enterprise-scale AI workloads with advanced features.

10 of 31

KServe Predictive AI Inference

Diagram from Alexa Griffith

11 of 31

AI Inference Traffic 🚦

Feature

Predictive AI Traffic

Generative AI Traffic

Task

Classification, Regression

Generate new data: text, image, code

Input Traffic

Small, structured and high volume

Variable, text/multi-modal

Traffic can be bursty

Output Traffic

Small with single label or score

Large, streaming, high bandwidth

Latency

Ultra-low latency

Moderate to high latency

Computation/

Scaling

CPU or GPU on L4 for transformer models

Horizontal scaling and maximize QPS

Heavy computation on H100/H200 with high memory bandwidth

Efficient GPU/VRAM utilization

12 of 31

KServe GenAI: Advanced Building Blocks for LLMs

LLM Metric-Based Autoscaling

Prompt Caching

Intelligent Routing, Traffic Management

GenAI Runtime

vLLM, TRT-LLM,�llm-d

Scale

Cost

Latency /

Throughput

Efficiency

13 of 31

KServe GenAI Architecture

14 of 31

Highlighted

Features

15 of 31

Why Concurrency Fails for LLMs?

LLMs run on expensive GPUs; idle models are massive wastes of money

Why Concurrency Fails ?��Bottleneck is the GPU VRAM for KV Cache

A single long prompt can fill VRAM, but KPA only sees concurrency = 1

Default KServe Autoscaler

In-flight concurrency

�A request is concurrent from the moment it is received until the last token is sent

16 of 31

KServe Metric based Autoscaler

GPU Utilization %

Measure the compute load, which is a good secondary metric

KV Cache VRAM %

When VRAM is full, the model can’t accept more requests; we must scale before hitting 100%

predictor:

model:

modelFormat:

name: huggingface

storageUri: "hf://Qwen2.5"

minReplicas: 1

maxReplicas: 5

autoScaling:

metrics:

- type: PodMetric

podmetric:

metric:

query:

"vllm:kv_cache_usage_percent"

target:

type: Value

value: "75"

17 of 31

Envoy AI Gateway: Token Rate Limiting

Diagram from Envoy AI Gateway website

Rate Limit for LLMs

Token usage extraction from response body

The “cost” is stored in the envoy dynamic metadata

18 of 31

Envoy AI Gateway: Inference Extension

Traditional load balancing falls short for LLM inference workloads

Intelligent endpoint picking with KV cache usage, LoRA adapter info and LLM load information

Diagram by Erica from Envoy AI Gateway website

19 of 31

GenAI Feature: llm-d Integration

An open-source framework for distributed large language model (LLM) inference that runs natively on Kubernetes

https://github.com/llm-d/llm-d

20 of 31

GenAI Feature: llm-d Integration

Intelligent Inference Scheduler

Prefix Caching

P/D Disaggregated Serving

Variant Autoscaling

21 of 31

LLM Inference Workload

Diagram by Jooho Lee from KServe website

Single-Node/Multi-GPU:

SLM (less than 70B)

Multi-Node/Multi-GPU: Distributed Inference for large models

Disaggregated Prefill/Decode: High Throughput Requirement

22 of 31

LLM Inference Control Plane API

Diagram by Jooho Lee from KServe website

Model Spec: Model URL, LORA

�Router Spec: Gateway API, Scheduler

�Workload Spec: Single/Multi Node, Prefill/Decode

Parallelism Spec: TP/DP/EP

23 of 31

LLM Inference Data Plane

Diagram by Jooho Lee from KServe website

Intelligent Endpoint Picking�

  • Prefix Cache Routing
  • Prefill-Decode Pool Routing
  • Load-Aware Routing

24 of 31

Roadmap

25 of 31

Unified Inference Fabric

Core Objectives: Unified Inference Fabric

  • Centralized platform for serving all open-source and fine-tuned, trained models
  • Standardized APIs and telemetry for consistency
  • Fully managed inference experience with optimized performance and efficient compute resource utilization
  • Secured identity for Agent-to-LLM traffic with SPIFFE/SPIRE

26 of 31

KServe: Gateway API to Unify Inference Workload

27 of 31

Community

28 of 31

19 Maintainers and 300+ Contributors!

29 of 31

Trusted by industry leaders

KServe is used in production by organizations across various industries, providing reliable model inference at scale.

30 of 31

Join Our Community

  • Repo: https://github.com/kserve/kserve
  • Website: https://kserve.github.io
  • Biweekly community meetings on Thursdays at 9 AM PST
  • #kserve and #kserve-contributors channels in the CNCF Slack

https://github.com/kserve/kserve

31 of 31