1 of 25

Yuan Tang @TerryTangYuan

Principal Software Engineer, Red Hat OpenShift AI�Project Lead, Argo & Kubeflow

Engineering Cloud Native AI Platform

2 of 25

AI Landscape & Ecosystem

3 of 25

AI Landscape & Ecosystem

4 of 25

Production Readiness - Scalability

  • Horizontal scaling - more pods
    • K8s horizontal pod autoscaler
    • Knative pod autoscaler: event-driven
  • Vertical scaling - more resources for existing pods
    • K8s vertical pod autoscaler
    • Resizer: adjust resources based on cluster nodes
  • Cluster autoscaler - automatically adjusts the size of a Kubernetes Cluster
  • Algorithm scalability
  • Hardware acceleration, resource sharing, scheduling

5 of 25

Production Readiness - Reliability

  • High availability and disaster recovery
    • K8s controller: leader election
  • Elasticity and fault-tolerant
  • Versioning: GitOps
  • Vendor lock-in/hybrid cloud
  • Support/SLAs

6 of 25

Production Readiness - Observability

  • Performance metrics
    • Statistical (HP tuning, experiment tracking)
    • Operational (system, resources)
  • Explainability & visualization
  • Pipeline tracing & audit log

7 of 25

Production Readiness - Flexibility

  • Various ML frameworks
  • Language-specific SDKs
  • Standardized APIs
  • Data: size, streaming/batching
  • Model: size, framework, performance
  • Integration with various hardware accelerators
  • Cloud/on-prem/edge
  • Vendor lock-in

8 of 25

Kubeflow: The ML Toolkit for Kubernetes

9 of 25

Cloud Native Production-ready AI Platform

  1. Data Processing
  • Big data - Apache Spark
    • Batch & Streaming
    • Time series
  • Welcome kubeflow/spark-operator to Kubeflow project!

10 of 25

Cloud Native Production-ready AI Platform

2. Model Training

Kubeflow Training Operator Architecture

11 of 25

Cloud Native Production-ready AI Platform

2. Model Training

Distributed training with TensorFlow

12 of 25

Cloud Native Production-ready AI Platform

3. Model Tuning

Katib: Kubernetes-native AutoML in Kubeflow

  • Supports HP tuning, NAS and Early Stopping
  • Agnostic to ML framework and programming languages
  • Can be deployed on local machines or on private/public clouds
  • Can orchestrate any Kubernetes workloads and custom resources
  • Natively integrated with Kubeflow components (Notebooks, Pipelines, Training Operators)

13 of 25

Cloud Native Production-ready AI Platform

3. Model Tuning

14 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

KServe: Highly scalable, standard, cloud agnostic model inference platform on Kubernetes

  • Performant, standardized inference protocol across ML frameworks.
  • Serverless inference workload with request based auto scaling including scale-to-zero on CPU and GPU.
  • High scalability, density packing and intelligent routing using ModelMesh.
  • Simple and pluggable production serving for inference, pre/post processing, monitoring and explainability.
  • Advanced deployments for canary rollout, pipeline, ensembles with InferenceGraph.

15 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Single model serving

16 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Multi model serving: ModelMesh

  • Designed for high-scale, high-density and frequently-changing model use cases.
  • Intelligently loads and unloads models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint.

17 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Inference Graph

18 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Canary Rollout

19 of 25

Cloud Native Production-ready AI Platform

4. Model Serving

Pluggable explainer runtimes

“Why did my model produce this inference result?”

Explainability

20 of 25

Cloud Native Production-ready AI Platform

5. Workflow

Argo Workflows

The container-native workflow engine for Kubernetes

  • Machine learning pipelines
  • Data processing/ETL
  • Infrastructure automation
  • Continuous delivery/integration

21 of 25

Cloud Native Production-ready AI Platform

5. Workflow

Argo Events

Event-driven workflow automation

  • Supports events from 20+ event sources
    • Webhooks, S3, GCP PubSub, Git, Slack, etc.
  • Supports 10+ triggers
    • Kubernetes Objects, Argo Workflow, AWS Lambda, Kafka, Slack, etc.
  • Manage everything from simple, linear, real-time to complex, multi-source events
  • CloudEvents specification compliant

22 of 25

Cloud Native Production-ready AI Platform

5. Workflow

Data ingestion

Model training

Cache store (Argo/K8s/etc.)

GitHub events (commits/PRs/tags/etc.)

The data has NOT been updated recently.

The data has already been updated recently.

Argo Events receives the events and then triggers a ML pipeline with Argo Workflow

23 of 25

Cloud Native Production-ready AI Platform

6. Iterations

24 of 25

Distributed Machine Learning Patterns

25 of 25

Questions?

Stay in touch!

  • LinkedIn/X/GitHub: @TerryTangYuan
  • Mastodon: https://fosstodon.org/@terrytangyuan