Yuan Tang @TerryTangYuan
Principal Software Engineer, Red Hat OpenShift AI�Project Lead, Argo & Kubeflow
Engineering Cloud Native AI Platform
AI Landscape & Ecosystem
AI Landscape & Ecosystem
Production Readiness - Scalability
Production Readiness - Reliability
Production Readiness - Observability
Production Readiness - Flexibility
Kubeflow: The ML Toolkit for Kubernetes
Cloud Native Production-ready AI Platform
Cloud Native Production-ready AI Platform
2. Model Training
Kubeflow Training Operator Architecture
Cloud Native Production-ready AI Platform
2. Model Training
Distributed training with TensorFlow
Cloud Native Production-ready AI Platform
3. Model Tuning
Katib: Kubernetes-native AutoML in Kubeflow
Cloud Native Production-ready AI Platform
3. Model Tuning
Cloud Native Production-ready AI Platform
4. Model Serving
KServe: Highly scalable, standard, cloud agnostic model inference platform on Kubernetes�
Cloud Native Production-ready AI Platform
4. Model Serving
Single model serving
Cloud Native Production-ready AI Platform
4. Model Serving
Multi model serving: ModelMesh
Cloud Native Production-ready AI Platform
4. Model Serving
Inference Graph
Cloud Native Production-ready AI Platform
4. Model Serving
Canary Rollout
Cloud Native Production-ready AI Platform
4. Model Serving
Pluggable explainer runtimes
“Why did my model produce this inference result?”
Explainability
Cloud Native Production-ready AI Platform
5. Workflow
Argo Workflows
The container-native workflow engine for Kubernetes
Cloud Native Production-ready AI Platform
5. Workflow
Argo Events
Event-driven workflow automation
Cloud Native Production-ready AI Platform
5. Workflow
Data ingestion
Model training
Cache store (Argo/K8s/etc.)
GitHub events (commits/PRs/tags/etc.)
The data has NOT been updated recently.
The data has already been updated recently.
Argo Events receives the events and then triggers a ML pipeline with Argo Workflow
Cloud Native Production-ready AI Platform
6. Iterations
Distributed Machine Learning Patterns
Questions?