1 of 37

2 of 37

Yuan Tang @TerryTangYuan

Principal Software Engineer, Red Hat OpenShift AI�Project Lead, Argo & Kubeflow

Production-Ready AI Platform on Kubernetes

3 of 37

Agenda

  • AI Landscape & Ecosystem
  • Elements of Production Readiness
    • Scalability
    • Reliability
    • Observability
    • Flexibility
  • Cloud Native Production-ready AI Platform
    • Data Processing
    • Model Training
    • Model Tuning
    • Model Serving
    • Workflow

4 of 37

Distributed Machine Learning Patterns

5 of 37

AI Landscape & Ecosystem

6 of 37

AI Landscape & Ecosystem

7 of 37

AI Landscape & Ecosystem

8 of 37

AI Landscape & Ecosystem

By Cloud Native AI WG

9 of 37

AI Landscape & Ecosystem

Cloud Native AI WG

10 of 37

AI Landscape & Ecosystem

How much infrastructure is needed

How much data scientist cares

11 of 37

Production Readiness - Scalability

  • Horizontal scaling - more pods
    • K8s horizontal pod autoscaler
    • Knative pod autoscaler: event-driven
  • Vertical scaling - more resources for existing pods
    • K8s vertical pod autoscaler
    • Resizer: adjust resources based on cluster nodes
  • Cluster autoscaler - automatically adjusts the size of a Kubernetes Cluster
  • Algorithm scalability
  • Hardware acceleration and resource sharing
  • Batch scheduling

12 of 37

Production Readiness - Reliability

  • High availability and disaster recovery
    • K8s controller: leader election
  • Elasticity and fault-tolerant
  • Versioning: GitOps
  • Vendor lock-in/hybrid cloud
  • Support/SLAs

13 of 37

Production Readiness - Observability

  • Performance metrics
    • Statistical (HP tuning, experiment tracking)
    • Operational (system, resources)
  • Explainability & visualization
  • Pipeline tracing
  • Audit log

14 of 37

Production Readiness - Flexibility

  • Various ML frameworks
  • Language-specific SDKs
  • Standardized APIs
  • Data: size, streaming/batching
  • Model: size, framework, performance
  • Integration with various hardware accelerators
  • Cloud/on-prem/edge
  • Vendor lock-in

15 of 37

Kubeflow: The ML Toolkit for Kubernetes

16 of 37

Cloud Native Production-ready AI Platform

  1. Data Processing
  • Big data - Apache Spark
    • Batch & Streaming
    • Time series
  • Welcome kubeflow/spark-operator to Kubeflow project!

17 of 37

Cloud Native Production-ready AI Platform

  • Data Processing
  • Fluid (fluid-cloudnative/fluid)
    • Enable dataset warmup and acceleration for data-intensive applications by using distributed cache in Kubernetes
    • Dataset abstractions for heterogeneous data source management
    • Data-aware scheduling

18 of 37

Cloud Native Production-ready AI Platform

2. Model Training

Data partitions

Worker #3

Worker #1

Worker #2

Consume data partition

Consume data partition

Consume data partition

Distributed all-reduce model training with multiple workers and data partitions

Source: Distributed Machine Learning Patterns book

19 of 37

Cloud Native Production-ready AI Platform

2. Model Training

Kubeflow Training Operator Architecture

20 of 37

Cloud Native Production-ready AI Platform

2. Model Training

Distributed training with TensorFlow

21 of 37

Cloud Native Production-ready AI Platform

2. Model Training

Distributed large model fine-tuning

Details here by Andrey Velichkevich

22 of 37

Cloud Native Production-ready AI Platform

3. Model Tuning

Katib: Kubernetes-native AutoML in Kubeflow

  • Supports HP tuning, NAS and Early Stopping
  • Agnostic to ML framework and programming languages
  • Can be deployed on local machines or on private/public clouds
  • Can orchestrate any Kubernetes workloads and custom resources
  • Natively integrated with Kubeflow components (Notebooks, Pipelines, Training Operators)

23 of 37

Cloud Native Production-ready AI Platform

3. Model Tuning

Katib Architecture

Reference paper https://arxiv.org/abs/2006.02085

24 of 37

Cloud Native Production-ready AI Platform

3. Model Tuning

Example

Experiment Budget

Search Space

Algorithm

Objective

Trial Template

25 of 37

Cloud Native Production-ready AI Platform

3. Model Tuning

Example

26 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

KServe: Highly scalable, standard, cloud agnostic model inference platform on Kubernetes

  • Performant, standardized inference protocol across ML frameworks.
  • Serverless inference workload with request based auto scaling including scale-to-zero on CPU and GPU.
  • High scalability, density packing and intelligent routing using ModelMesh.
  • Simple and pluggable production serving for inference, pre/post processing, monitoring and explainability.
  • Advanced deployments for canary rollout, pipeline, ensembles with InferenceGraph.

27 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

Single model serving

28 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

Multi model serving: ModelMesh

  • Designed for high-scale, high-density and frequently-changing model use cases.
  • Intelligently loads and unloads models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint.

29 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

LLMs

> curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer -d '{"id": "42","inputs": [{"name": "input0","shape": [-1],"datatype": "BYTES","data": [""Where is Eiffel Tower?"]}]}'

{"text_output":"The Eiffel Tower is located in the 7th arrondissement of Paris, France. It stands on the Champ de Mars, a large public park next to the Seine River. The tower's exact address is:\n\n2 Rue du Champ de Mars, 75007 Paris, France.","model_name":"llama2"}

30 of 37

Cloud Native Production-ready AI Platform

4. Model Serving

Problem: model initialization takes a long time

Solution: Modelcars feature (model is in OCI image) in KServe brings:

  • Reduced Startup Times: By avoiding repetitive downloads of large models, startup delays are significantly minimized.
  • Lower Disk Space Usage: The feature decreases the need for duplicated local storage, conserving disk space.
  • Enhanced Performance: Modelcars allows for advanced techniques like pre-fetching images and lazy-loading, improving efficiency.

31 of 37

Cloud Native Production-ready AI Platform

5. Workflow

Argo Workflows

The container-native workflow engine for Kubernetes

  • Machine learning pipelines
  • Data processing/ETL
  • Infrastructure automation
  • Continuous delivery/integration

32 of 37

Cloud Native Production-ready AI Platform

5. Workflow

CRDs and Controllers

  • Kubernetes custom resources that natively integrates with other K8s resources (volumes, secrets, etc.)

Interfaces

  • CLI: manage workflows and perform operations (submit, suspend, delete/etc.)
  • Server: REST & gRPC interfaces
  • SDKs: Python, Go, and Java SDKs
  • UI: manage and visualize workflows, artifacts, logs, resource usages analytics, etc.

33 of 37

Example

Cloud Native Production-ready AI Platform

5. Workflow

34 of 37

Cloud Native Production-ready AI Platform

5. Workflow

Argo Events

Event-driven workflow automation

  • Supports events from 20+ event sources
    • Webhooks, S3, GCP PubSub, Git, Slack, etc.
  • Supports 10+ triggers
    • Kubernetes Objects, Argo Workflow, AWS Lambda, Kafka, Slack, etc.
  • Manage everything from simple, linear, real-time to complex, multi-source events
  • CloudEvents specification compliant

35 of 37

Cloud Native Production-ready AI Platform

5. Workflow

Data ingestion

Model training

Cache store (Argo/K8s/etc.)

GitHub events (commits/PRs/tags/etc.)

The data has NOT been updated recently.

The data has already been updated recently.

Argo Events receives the events and then triggers a ML pipeline with Argo Workflow

36 of 37

Cloud Native Production-ready AI Platform

6. Iterations

37 of 37

Distributed Machine Learning Patterns