1 of 20

Gateway API Inference Extension�v0.1 Review

January 7, 2025

2 of 20

Overview

  • Project Goals
  • Project Scope
  • Personas
  • API Structure and Naming
  • Review proposed v0.1 API Surface
    • InferencePool
    • InferenceModel
  • Timeline

3 of 20

Project Goals

  • Provide a standard set of APIs for routing to Inference Workloads in Kubernetes
  • Provide a common pattern for developing Inference-focused routing extensions
  • Provide a reference implementation of that pattern
  • Provide a specification for model server frameworks to be compatible with this project

4 of 20

Project Use Cases

  • Model-aware routing: Make routing decisions based on model names (in addition to all other attributes supported by Gateway).�
  • Serving priority: Specify the serving priority of your models to prioritize more critical models over less critical ones.�
  • Model rollouts: Incrementally roll out new model versions by traffic splitting definitions based on the model names. �
  • More than routing: Configure additional services such as AI Safety checks or Semantic Caching.�
  • Tunable LB algorithms: Fine tune parameters for an existing extension or build/fork your own.

5 of 20

Project Scope

  • Current
    • API
    • Extension pattern
      • Reference implementation of that pattern
    • Pattern for model server frameworks to be compatible
  • Future
    • Conformance tests for
      • Controllers implementing API
      • Extensions
      • Model Server frameworks
    • Standard API surface for related extensions such as AI Safety or Semantic Caching

6 of 20

Request Flow

3. The extension gets metrics from compatible model server frameworks

Gateway

InferencePool�Pods running compatible Model Server Framework

Endpoint Selection Extension

GET /completions

2. The gateway forwards the request and endpoint info to extension

1. The gateway selects the InferencePool to route to based on standard Gateway API configuration

5. The gateway sends request to endpoint selected by endpoint picker

4. The extension tells the Gateway which endpoint to route to

7 of 20

Personas

Inference Platform Admin

  • Creates and manages the infrastructure necessary to run LLM workloads, including:
    • Hardware & Resource Allocation
    • Model Server
    • Base Model
  • Rolling out updates to any of the above

Inference Workload Owner

  • Owns one or many Generative AI workloads, manages:
    • Objectives
    • Fine-tuning
    • LoRA Adapters
    • System Prompts
    • Prompt Cache
  • Rolling out updates to any of the above

8 of 20

Inference Platform Owner: Scope of API

Future

  • GPU/TPU sharing
  • Default objectives
  • Configuration on which metrics to collect
  • Define different LB algorithms/metrics to use and different knobs for those LB algorithms
  • Define different thresholds for probing or synchronous vs asynchronous probing

Now

  • A way to select a set of Pods that:
    • Run the same LLM framework (e.g., vLLM)
    • Share the same base model
    • Are capable of running the same set of adapters
  • Select the port that should be targeted on those Pods
  • Primary route and policy target
  • Configuring endpoint selection extension

9 of 20

InferencePool (Owned by Inference Platform Owner)

  • Select the port that should be targeted on those Pods
    • Starting with target port number exclusively
    • May add name in the future
  • Selector:
    • Same simple selector that Service uses
  • ExtensionRef:
    • More on this later

apiVersion: inference.x-k8s.io/v1alpha1

kind: InferencePool

metadata:

name: gemma-pool

spec:� targetPortNumber: 443� selector:

app: vllm-gemma-1-5-pro

extensionRef:

name: endpoint-picker

10 of 20

InferencePool Target Port Options

  • A) Struct (targetPort: {name, number})
  • B) IntOrString (targetPort: 80|"http")
  • C) Separate fields (targetPortNumber, targetPortName)

11 of 20

InferencePool Why Not Service?

  • Avoids pressure to add new fields and capabilities to the already overloaded Kubernetes Service API
  • Avoids creating another way to access these endpoints (a ClusterIP Service would bypass the Inference Gateway, InferencePool will not have the same ClusterIP routing concept)
  • Sets a clear expectation that creating this resource will be accompanied with some more advanced endpoint selection logic

12 of 20

Pod Selector

Port Config

ClusterIP

NodePort

Service

InferencePool

Extension Config

Inference Config

Multiple Ports

Protocols

Load Balancer

Session Affinity

DNS

13 of 20

Inference Workload Owner: Scope of API

Future

  • Additional objectives
  • Preferred LLM framework or hardware
  • Cost criteria
    • Latency vs throughput

Now

  • Define one to many “workloads” that include:
    • Criticality
    • Model Name Mapping
      • Input: Foo
      • Output:
        • 50% foo-v1
        • 50% foo-v2

14 of 20

InferenceModel (Owned by Inference Workload Owner)

  • Model Name:
    • Max Length 253
    • All chars allowed
  • Criticality values:
    • Critical
    • Default
    • Sheddable
  • Target Models:
    • Allows "rewriting" model name before it reaches backend
    • Can also be used to split traffic between different versions of a model

apiVersion: inference.x-k8s.io/v1alpha1

kind: InferenceModel

metadata:

name: tweet-summary

spec:

modelName: tweet-summary

criticality: Critical

poolRef:

name: gemma-pool

targetModels:

- name: tweet-summary-0

weight: 50� - name: tweet-summary-1

weight: 50

15 of 20

InferencePool Extensions and Algorithms

  • Each InferencePool will be accompanied by some form of extension or algorithm that is responsible for implementing the objectives defined by the InferenceModels
  • What needs to be configurable?
    • Endpoint picker configuration
      • Reference to extension Service
        • Likely necessary to enable users to bring their own extension, maybe at some point this could be optional
      • Algorithm name built into implementation
        • Some implementations may be able to support algorithms natively
    • Extension connection configuration (fail open or closed, more in future)
      • Can’t fail open when an InferenceModel attached to the pool is modifying request body

16 of 20

InferencePool The Straightforward Bits

  • Failure Mode: failureMode: FailOpen|FailClose
  • Algorithm:
    • A) Struct (algorithm: {name})
    • B) String (algorithmName)

17 of 20

InferencePool Extension Ref Options

  • A) Service Ref (extensionRef: {name, kind, group, port})
  • B) Deployment Ref (extensionRef: {name, kind, group, port})
  • C) Pod Selector + Port (extension: {selector, port})

18 of 20

API Structure

👷🏾‍♀️👷🏻‍♂️Inference Platform Owners

👨🏾‍💼👩🏻‍💻Application

Developers

👨🏽‍🔧👩🏼‍🔧Cluster Operators

Gateway

HTTPRoute

Service

InferencePool

InferenceModel

InferenceModel

🧑🏼‍⚕️🧑🏿‍💻Inference Workload Owners

19 of 20

Timeline

  • Aiming for RC1 this week
  • v0.1 next week
  • Implementations ASAP
    • Envoy Gateway
    • Gloo k8sGateway
    • GKE Gateway
  • v0.2 around KubeCon London
  • v1.0 this Summer

20 of 20

Resources

  • v0.1 API Review: #154
  • Configurable Extensions API Review: #162