1 of 8

Evolving KServe: The Unified Model Inference Platform for Both Predictive and Generative AI

Yuan Tang

Senior Principal Software Engineer, Red Hat

Project Lead, KServe

#KubeCon #CloudNativeCon

2 of 8

Model inference platform

Supported runtimes

Orchestration

Hardware accelerators (GPUs, CPUs, etc.)

Cloud native integrations

Autoscaling

Networking

Generative

Predictive

GenAI integrations

3 of 8

KServe

CNCF incubating project (Sept 2025)

30+ adopters

19 maintainers

300+ contributors

Trusted by industry leaders

KServe is used in production by organizations across various industries, providing reliable model inference at scale.

4 of 8

Predictive AI Features

Illustrations by Alexa Griffith

5 of 8

GenAI Building Blocks

LLM Metric-Based Autoscaling

Prompt Caching

Intelligent Routing, Traffic Management

GenAI Runtime

vLLM, TRT-LLM,�llm-d

Scale

Cost

Latency /

Throughput

Efficiency

Now, when we talk about serving LLMs, the recipe gets complex and let's talk about why— we need features like prompt caching, genai runtimes, and intelligent routing.

These are our redstone circuits — the pieces that let us automate, optimize, and scale GenAI inference efficiently.��Focus on how KServe specifically supports Generative AI models.

Discuss challenges and solutions for serving large GenAI models with KServe.

Provide examples of GenAI applications using KServe.

Highlight KServe's GenAI-ready features: LLM metric-based autoscaling, model caching, multi-node inference, and OpenAI protocol support. Explain how these features address the unique demands of GenAI.

LLM Metric-Based Autoscaling: Scales based on LLM metrics like token throughput.

OpenAI Protocol Support: Compatible for chat completions and embedding tasks.

Multi-Node Inference: Supports distributed LLM inference with vLLM.

Model & Prompt Caching: Reduces load times and improves throughput.

Traffic Management: Integrates with Envoy AI Gateway.

6 of 8

Optimized LLM Inference

A CNCF Sandbox project for distributed large language model inference that runs natively on Kubernetes.

7 of 8

Join Our Community!

Repo: https://github.com/kserve/kserve
Website: https://kserve.github.io
Biweekly community meetings on Thursdays at 9 AM PST
#kserve and #kserve-contributors channels in the CNCF Slack

https://github.com/kserve/kserve

8 of 8

Find Us This Week!

Check out our maintainer session on Thursday! https://sched.co/2EF54
We have a project booth throughout the week!

Kiosk Number: P-8A
Location: Halls 1-5 | Project Pavilion
Schedule:

Tuesday 10:15 - 14:40
Wednesday 10:00 - 13:30
Thursday 10:00 - 12:00

https://github.com/kserve/kserve