3 of 20

Project Goals

Provide a standard set of APIs for routing to Inference Workloads in Kubernetes
Provide a common pattern for developing Inference-focused routing extensions
Provide a reference implementation of that pattern
Provide a specification for model server frameworks to be compatible with this project

4 of 20

Project Use Cases

Model-aware routing: Make routing decisions based on model names (in addition to all other attributes supported by Gateway).�
Serving priority: Specify the serving priority of your models to prioritize more critical models over less critical ones.�
Model rollouts: Incrementally roll out new model versions by traffic splitting definitions based on the model names. �
More than routing: Configure additional services such as AI Safety checks or Semantic Caching.�
Tunable LB algorithms: Fine tune parameters for an existing extension or build/fork your own.

5 of 20

Project Scope

Current

API
Extension pattern

Reference implementation of that pattern

Pattern for model server frameworks to be compatible

Future

Conformance tests for

Controllers implementing API
Extensions
Model Server frameworks

Standard API surface for related extensions such as AI Safety or Semantic Caching

6 of 20

Request Flow

3. The extension gets metrics from compatible model server frameworks

Gateway

InferencePool�Pods running compatible Model Server Framework

Endpoint Selection Extension

GET /completions

2. The gateway forwards the request and endpoint info to extension

1. The gateway selects the InferencePool to route to based on standard Gateway API configuration

5. The gateway sends request to endpoint selected by endpoint picker

4. The extension tells the Gateway which endpoint to route to

7 of 20

Personas

Inference Platform Admin

Creates and manages the infrastructure necessary to run LLM workloads, including:

Hardware & Resource Allocation
Model Server
Base Model

Rolling out updates to any of the above

Inference Workload Owner

Owns one or many Generative AI workloads, manages:

Objectives
Fine-tuning
LoRA Adapters
System Prompts
Prompt Cache

Rolling out updates to any of the above

8 of 20

Inference Platform Owner: Scope of API

Future

GPU/TPU sharing
Default objectives
Configuration on which metrics to collect
Define different LB algorithms/metrics to use and different knobs for those LB algorithms
Define different thresholds for probing or synchronous vs asynchronous probing

Now

A way to select a set of Pods that:

Run the same LLM framework (e.g., vLLM)
Share the same base model
Are capable of running the same set of adapters

Select the port that should be targeted on those Pods
Primary route and policy target
Configuring endpoint selection extension

9 of 20

InferencePool (Owned by Inference Platform Owner)

Select the port that should be targeted on those Pods

Starting with target port number exclusively
May add name in the future

Selector:

Same simple selector that Service uses

ExtensionRef:

More on this later

apiVersion: inference.x-k8s.io/v1alpha1

kind: InferencePool

metadata:

spec:� targetPortNumber: 443� selector:

app: vllm-gemma-1-5-pro

extensionRef:

10 of 20

InferencePool Target Port Options

A) Struct (targetPort: {name, number})
B) IntOrString (targetPort: 80|"http")
C) Separate fields (targetPortNumber, targetPortName)

11 of 20

InferencePool Why Not Service?

Avoids pressure to add new fields and capabilities to the already overloaded Kubernetes Service API
Avoids creating another way to access these endpoints (a ClusterIP Service would bypass the Inference Gateway, InferencePool will not have the same ClusterIP routing concept)
Sets a clear expectation that creating this resource will be accompanied with some more advanced endpoint selection logic

12 of 20

Pod Selector

Port Config

ClusterIP

NodePort

Service

InferencePool

Extension Config

Inference Config

Multiple Ports

Protocols

Load Balancer

Session Affinity

DNS

13 of 20

Inference Workload Owner: Scope of API

Future

Additional objectives
Preferred LLM framework or hardware
Cost criteria

Latency vs throughput

Now

Define one to many “workloads” that include:

Criticality
Model Name Mapping

Input: Foo
Output:

50% foo-v1
50% foo-v2

14 of 20

InferenceModel (Owned by Inference Workload Owner)

Model Name:

Max Length 253
All chars allowed

Criticality values:

Critical
Default
Sheddable

Target Models:

Allows "rewriting" model name before it reaches backend
Can also be used to split traffic between different versions of a model

apiVersion: inference.x-k8s.io/v1alpha1

kind: InferenceModel

metadata:

spec:

modelName: tweet-summary

criticality: Critical

poolRef:

targetModels:

- name: tweet-summary-0

weight: 50� - name: tweet-summary-1

weight: 50

15 of 20

InferencePool Extensions and Algorithms

Each InferencePool will be accompanied by some form of extension or algorithm that is responsible for implementing the objectives defined by the InferenceModels
What needs to be configurable?

Endpoint picker configuration

Reference to extension Service

Likely necessary to enable users to bring their own extension, maybe at some point this could be optional

Algorithm name built into implementation

Some implementations may be able to support algorithms natively

Extension connection configuration (fail open or closed, more in future)

Can’t fail open when an InferenceModel attached to the pool is modifying request body

16 of 20

InferencePool The Straightforward Bits

Failure Mode: failureMode: FailOpen|FailClose
Algorithm:

A) Struct (algorithm: {name})
B) String (algorithmName)

17 of 20

InferencePool Extension Ref Options

A) Service Ref (extensionRef: {name, kind, group, port})
B) Deployment Ref (extensionRef: {name, kind, group, port})
C) Pod Selector + Port (extension: {selector, port})

18 of 20

API Structure

👷🏾‍♀️👷🏻‍♂️ �Inference Platform Owners

👨🏾‍💼👩🏻‍💻 �Application

Developers

👨🏽‍🔧👩🏼‍🔧 �Cluster Operators