1 of 25

Yuan Tang @TerryTangYuan

Principal Software Engineer, Red Hat OpenShift AI�Project Lead, Argo & Kubeflow

Engineering Cloud Native AI Platform

2 of 25

AI Landscape & Ecosystem

CNCF Cloud Native Landscape

LF AI & Data Landscape

3 of 25

AI Landscape & Ecosystem

4 of 25

Production Readiness - Scalability

Horizontal scaling - more pods

K8s horizontal pod autoscaler
Knative pod autoscaler: event-driven

Vertical scaling - more resources for existing pods

K8s vertical pod autoscaler
Resizer: adjust resources based on cluster nodes

Cluster autoscaler - automatically adjusts the size of a Kubernetes Cluster
Algorithm scalability
Hardware acceleration, resource sharing, scheduling

5 of 25

Production Readiness - Reliability

High availability and disaster recovery

K8s controller: leader election

Elasticity and fault-tolerant
Versioning: GitOps
Vendor lock-in/hybrid cloud
Support/SLAs

6 of 25

Production Readiness - Observability

Performance metrics

Statistical (HP tuning, experiment tracking)
Operational (system, resources)

Explainability & visualization
Pipeline tracing & audit log

7 of 25

Production Readiness - Flexibility

Various ML frameworks
Language-specific SDKs
Standardized APIs
Data: size, streaming/batching
Model: size, framework, performance
Integration with various hardware accelerators
Cloud/on-prem/edge
Vendor lock-in

8 of 25

Kubeflow: The ML Toolkit for Kubernetes

9 of 25

Cloud Native Production-ready AI Platform

Data Processing

Big data - Apache Spark

Batch & Streaming
Time series

Welcome kubeflow/spark-operator to Kubeflow project!

10 of 25

Cloud Native Production-ready AI Platform

2. Model Training

Kubeflow Training Operator Architecture

11 of 25

Cloud Native Production-ready AI Platform

2. Model Training

Distributed training with TensorFlow

12 of 25

Cloud Native Production-ready AI Platform

3. Model Tuning

Katib: Kubernetes-native AutoML in Kubeflow

Supports HP tuning, NAS and Early Stopping
Agnostic to ML framework and programming languages
Can be deployed on local machines or on private/public clouds
Can orchestrate any Kubernetes workloads and custom resources
Natively integrated with Kubeflow components (Notebooks, Pipelines, Training Operators)

13 of 25

Cloud Native Production-ready AI Platform

3. Model Tuning

14 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

KServe: Highly scalable, standard, cloud agnostic model inference platform on Kubernetes�

Performant, standardized inference protocol across ML frameworks.
Serverless inference workload with request based auto scaling including scale-to-zero on CPU and GPU.
High scalability, density packing and intelligent routing using ModelMesh.
Simple and pluggable production serving for inference, pre/post processing, monitoring and explainability.
Advanced deployments for canary rollout, pipeline, ensembles with InferenceGraph.

15 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Single model serving

16 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Multi model serving: ModelMesh

Designed for high-scale, high-density and frequently-changing model use cases.
Intelligently loads and unloads models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint.

17 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Inference Graph

18 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Canary Rollout

19 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Pluggable explainer runtimes

“Why did my model produce this inference result?”

Explainability

20 of 25

Cloud Native Production-ready AI Platform

5. Workflow

Argo Workflows

The container-native workflow engine for Kubernetes

Machine learning pipelines
Data processing/ETL
Infrastructure automation
Continuous delivery/integration

21 of 25

Cloud Native Production-ready AI Platform

5. Workflow

Argo Events

Event-driven workflow automation

Supports events from 20+ event sources

Webhooks, S3, GCP PubSub, Git, Slack, etc.

Supports 10+ triggers

Kubernetes Objects, Argo Workflow, AWS Lambda, Kafka, Slack, etc.

Manage everything from simple, linear, real-time to complex, multi-source events
CloudEvents specification compliant

22 of 25

Cloud Native Production-ready AI Platform

5. Workflow

Data ingestion

Model training

Cache store (Argo/K8s/etc.)

GitHub events (commits/PRs/tags/etc.)

The data has NOT been updated recently.

The data has already been updated recently.

Argo Events receives the events and then triggers a ML pipeline with Argo Workflow

23 of 25

Cloud Native Production-ready AI Platform

6. Iterations

24 of 25

http://mng.bz/QZgv

Distributed Machine Learning Patterns

25 of 25

Questions?

Stay in touch!

LinkedIn/X/GitHub: @TerryTangYuan
Mastodon: https://fosstodon.org/@terrytangyuan