The Machine Learning Toolkit for Kubernetes
Krishna Durai, Cisco Systems, Bangalore
Impact of Machine Learning
2
Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes
Agenda
4
ML Lifecycle: Perception vs. Reality
5
Source: KubeCon 2018 (Budapest) Talk: Building ML Products with Kubeflow
Building ML Products
6
Building�a�Model
Logging
Data�Ingestion
Data�Analysis
Data�Transform�-ation
Data�Validation
Data Splitting
Trainer
Model�Validation
Training�At Scale
Roll-out
Serving
Monitoring
Source: Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape
Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes
Kubeflow runs on Kubernetes
8
General Applications
Infrastructure
(Cloud/On-Prem)
Container Runtimes
ML Workloads
(Modelling, training, roll-out, serving, ...)
Infrastructure
(Cloud/On-Prem)
Container Runtimes
Kubeflow Current Features
9
Building�a�Model
Logging
Data�Ingestion
Data�Analysis
Data�Transform�-ation
Data�Validation
Data Splitting
Trainer
Model�Validation
Training�At Scale
Roll-out
Serving
Monitoring
Source: Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape
Kubeflow Architecture
Make it easy to deploy and administer a platform
Tie it together using
10
Libraries and CLIs
Higher Level Services
Low Level APIs / Services
Arena
kfctl
kubectl
Katib
Pipelines
Notebooks
Fairing
TFJob
PyTorchJob
Jupyter CR
Seldon CR
Kubebench
Metadata
Orchestration
Pipelines CR
Argo
Study Job
MPIJob
Spark Job
Model DB
TFX
Developed By Kubeflow
Developed Outside Kubeflow
Adapted from Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape (Not all components are shown)
Job Operators (TF/PyTorch/MPI/...)
Provides a set of K8s Custom Resources that turns distributed concepts in ML frameworks into K8s resources
Makes it easy to configure and run local/distributed training jobs of various ML frameworks on K8s
11
TFJob Operator
Notebooks (and beyond)
Orchestration for Notebooks
Integration with advanced DSLs
12
Pipelines
Combine individual tasks into end-to-end workflows
Provide orchestration and service integration
Enable components & sharing
Help with job tracking, experimentation, monitoring
13
Katib - AutoML and Hyperparameter Tuning
Provides a service that automates optimization of model hyperparameters and neural network architectures
Makes it easy to track, view and manage experiment results for hyperparameter candidates
Growing suggestion algorithms
14
Zhou, Jinan, et al. "Katib: A Distributed General AutoML Platform on Kubernetes." {USENIX} OpML 19. 2019
Kubebench
A harness for benchmarking ML workloads on Kubernetes
Pipelines and automates common benchmarking tasks (configuration, deployment, result collection, ...)
Evaluate performance of ML models as well as the system stack
Enable deployment of state-of-the-art benchmarking workloads on K8s
15
TensorFlow MNIST Handwritten Digits Detection with Hyperparameter Tuning
Demo
16
Use Case: Conversational Chatbot
Goal: Train, evaluate, and test an AI-powered chatbot for suggesting appointments
The Rasa Stack is a set of open source machine learning tools for developers to create contextual AI assistants and chatbots:
17
Use Case: Initial Solution
Rasa Model Training Pipeline
TensorFlow Serving
Trains Rasa NLU and Rasa Core
models
TF trained models’ metadata
Apache Beam Data Processing
Rasa JSON input format
Training Data
Prepares Training Data for Rasa
Deployed on AWS EC2 machine
Use Case: Kubeflow Solution
19
deploy-model
Comparing Kubeflow with Similar Frameworks
Features | Kubeflow | ML Flow | H2O.ai |
Model Training, HP Tuning, Serving Deployments | | | |
Scalability - Training, Serving | | | |
Integrated Development Environment | | | |
Experiment Tracking | | | |
Workflow Pipelines | | | |
Neural Architecture Search | | | |
Portability Across Infrastructure Providers | | | |
User Management | | | |
Managed Infrastructure and Auto Scaling | | | |
Model Management | | | |
Cloud Platform Integrations
21
Community
22
What’s Next?
23
Agenda
24
Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes
Action Items
26