1 of 37

2 of 37

Yuan Tang @TerryTangYuan

Principal Software Engineer, Red Hat OpenShift AI�Project Lead, Argo & Kubeflow

Production-Ready AI Platform on Kubernetes

3 of 37

Agenda

AI Landscape & Ecosystem
Elements of Production Readiness

Scalability
Reliability
Observability
Flexibility

Cloud Native Production-ready AI Platform

Data Processing
Model Training
Model Tuning
Model Serving
Workflow

4 of 37

http://mng.bz/QZgv

Distributed Machine Learning Patterns

5 of 37

AI Landscape & Ecosystem

CNCF Cloud Native Landscape

LF AI & Data Landscape

6 of 37

AI Landscape & Ecosystem

Opening Remarks by Priyanka Sharma at KubeCon EU 2024

7 of 37

AI Landscape & Ecosystem

8 of 37

AI Landscape & Ecosystem

By Cloud Native AI WG

9 of 37

AI Landscape & Ecosystem

Cloud Native AI WG

10 of 37

AI Landscape & Ecosystem

How much infrastructure is needed

How much data scientist cares

Savin & Yuan, KubeCon NA 2023

11 of 37

Production Readiness - Scalability

Horizontal scaling - more pods

K8s horizontal pod autoscaler
Knative pod autoscaler: event-driven

Vertical scaling - more resources for existing pods

K8s vertical pod autoscaler
Resizer: adjust resources based on cluster nodes

Cluster autoscaler - automatically adjusts the size of a Kubernetes Cluster
Algorithm scalability
Hardware acceleration and resource sharing
Batch scheduling

12 of 37

Production Readiness - Reliability

High availability and disaster recovery

K8s controller: leader election

Elasticity and fault-tolerant
Versioning: GitOps
Vendor lock-in/hybrid cloud
Support/SLAs

13 of 37

Production Readiness - Observability

Performance metrics

Statistical (HP tuning, experiment tracking)
Operational (system, resources)

Explainability & visualization
Pipeline tracing
Audit log

14 of 37

Production Readiness - Flexibility

Various ML frameworks
Language-specific SDKs
Standardized APIs
Data: size, streaming/batching
Model: size, framework, performance
Integration with various hardware accelerators
Cloud/on-prem/edge
Vendor lock-in

15 of 37

Kubeflow: The ML Toolkit for Kubernetes

https://www.kubeflow.org/

16 of 37

Cloud Native Production-ready AI Platform

Data Processing

Big data - Apache Spark

Batch & Streaming
Time series

Welcome kubeflow/spark-operator to Kubeflow project!

17 of 37

Cloud Native Production-ready AI Platform

Data Processing

Fluid (fluid-cloudnative/fluid)

Enable dataset warmup and acceleration for data-intensive applications by using distributed cache in Kubernetes
Dataset abstractions for heterogeneous data source management
Data-aware scheduling

18 of 37

Cloud Native Production-ready AI Platform

2. Model Training

Data partitions

Worker #3

Worker #1

Worker #2

Consume data partition

Distributed all-reduce model training with multiple workers and data partitions

Source: Distributed Machine Learning Patterns book

19 of 37

Cloud Native Production-ready AI Platform

2. Model Training

Kubeflow Training Operator Architecture

20 of 37

Cloud Native Production-ready AI Platform

2. Model Training

Distributed training with TensorFlow

21 of 37

Cloud Native Production-ready AI Platform

2. Model Training

Distributed large model fine-tuning

Details here by Andrey Velichkevich

22 of 37

Cloud Native Production-ready AI Platform

3. Model Tuning

Katib: Kubernetes-native AutoML in Kubeflow

Supports HP tuning, NAS and Early Stopping
Agnostic to ML framework and programming languages
Can be deployed on local machines or on private/public clouds
Can orchestrate any Kubernetes workloads and custom resources
Natively integrated with Kubeflow components (Notebooks, Pipelines, Training Operators)

23 of 37

Cloud Native Production-ready AI Platform

3. Model Tuning

Katib Architecture

Reference paper https://arxiv.org/abs/2006.02085

24 of 37

Cloud Native Production-ready AI Platform

3. Model Tuning

Example

Experiment Budget

Search Space

Algorithm

Objective

Trial Template

25 of 37

Cloud Native Production-ready AI Platform

3. Model Tuning

Example

26 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

KServe: Highly scalable, standard, cloud agnostic model inference platform on Kubernetes�

Performant, standardized inference protocol across ML frameworks.
Serverless inference workload with request based auto scaling including scale-to-zero on CPU and GPU.
High scalability, density packing and intelligent routing using ModelMesh.
Simple and pluggable production serving for inference, pre/post processing, monitoring and explainability.
Advanced deployments for canary rollout, pipeline, ensembles with InferenceGraph.

27 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

Single model serving

28 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

Multi model serving: ModelMesh

Designed for high-scale, high-density and frequently-changing model use cases.
Intelligently loads and unloads models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint.

29 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

LLMs

> curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer -d '{"id": "42","inputs": [{"name": "input0","shape": [-1],"datatype": "BYTES","data": [""Where is Eiffel Tower?"]}]}'

{"text_output":"The Eiffel Tower is located in the 7th arrondissement of Paris, France. It stands on the Champ de Mars, a large public park next to the Seine River. The tower's exact address is:\n\n2 Rue du Champ de Mars, 75007 Paris, France.","model_name":"llama2"}

30 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

Problem: model initialization takes a long time

Solution: Modelcars feature (model is in OCI image) in KServe brings:

Reduced Startup Times: By avoiding repetitive downloads of large models, startup delays are significantly minimized.
Lower Disk Space Usage: The feature decreases the need for duplicated local storage, conserving disk space.
Enhanced Performance: Modelcars allows for advanced techniques like pre-fetching images and lazy-loading, improving efficiency.

31 of 37

Cloud Native Production-ready AI Platform

5. Workflow

Argo Workflows

The container-native workflow engine for Kubernetes

Machine learning pipelines
Data processing/ETL
Infrastructure automation
Continuous delivery/integration

32 of 37

Cloud Native Production-ready AI Platform

5. Workflow

CRDs and Controllers

Kubernetes custom resources that natively integrates with other K8s resources (volumes, secrets, etc.)

Interfaces

CLI: manage workflows and perform operations (submit, suspend, delete/etc.)
Server: REST & gRPC interfaces
SDKs: Python, Go, and Java SDKs
UI: manage and visualize workflows, artifacts, logs, resource usages analytics, etc.

33 of 37

Example

Cloud Native Production-ready AI Platform

5. Workflow

34 of 37

Cloud Native Production-ready AI Platform

5. Workflow

Argo Events

Event-driven workflow automation

Supports events from 20+ event sources

Webhooks, S3, GCP PubSub, Git, Slack, etc.

Supports 10+ triggers

Kubernetes Objects, Argo Workflow, AWS Lambda, Kafka, Slack, etc.

Manage everything from simple, linear, real-time to complex, multi-source events
CloudEvents specification compliant

35 of 37

Cloud Native Production-ready AI Platform

5. Workflow

Data ingestion

Model training

Cache store (Argo/K8s/etc.)

GitHub events (commits/PRs/tags/etc.)

The data has NOT been updated recently.

The data has already been updated recently.

Argo Events receives the events and then triggers a ML pipeline with Argo Workflow

36 of 37

Cloud Native Production-ready AI Platform

6. Iterations

Savin & Yuan, KubeCon 2023

37 of 37

http://mng.bz/QZgv

Distributed Machine Learning Patterns