1 of 26

The Machine Learning Toolkit for Kubernetes

Krishna Durai, Cisco Systems, Bangalore

2 of 26

Impact of Machine Learning

2

3 of 26

Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes

4 of 26

Agenda

4

ML Lifecycle
Kubeflow - Architecture and Features
Demo
Alternatives to Kubeflow and Comparison
Kubeflow Community

5 of 26

ML Lifecycle: Perception vs. Reality

5

Source: KubeCon 2018 (Budapest) Talk: Building ML Products with Kubeflow

6 of 26

Building ML Products

6

Building�a�Model

Logging

Data�Ingestion

Data�Analysis

Data�Transform�-ation

Data�Validation

Data Splitting

Trainer

Model�Validation

Training�At Scale

Roll-out

Serving

Monitoring

Source: Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape

7 of 26

Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes

8 of 26

Kubeflow runs on Kubernetes

8

General Applications

Infrastructure

(Cloud/On-Prem)

Container Runtimes

ML Workloads

(Modelling, training, roll-out, serving, ...)

Infrastructure

(Cloud/On-Prem)

Container Runtimes

9 of 26

Kubeflow Current Features

9

Building�a�Model

Logging

Data�Ingestion

Data�Analysis

Data�Transform�-ation

Data�Validation

Data Splitting

Trainer

Model�Validation

Training�At Scale

Roll-out

Serving

Monitoring

Source: Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape

10 of 26

Kubeflow Architecture

Make it easy to deploy and administer a platform

Leverage Kubeflow native & non Kubeflow components

Tie it together using

Orchestration
Metadata

10

Libraries and CLIs

Higher Level Services

Low Level APIs / Services

Arena

kfctl

kubectl

Katib

Pipelines

Notebooks

Fairing

TFJob

PyTorchJob

Jupyter CR

Seldon CR

Kubebench

Metadata

Orchestration

Pipelines CR

Argo

Study Job

MPIJob

Spark Job

Model DB

TFX

Developed By Kubeflow

Developed Outside Kubeflow

Adapted from Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape (Not all components are shown)

11 of 26

Job Operators (TF/PyTorch/MPI/...)

Provides a set of K8s Custom Resources that turns distributed concepts in ML frameworks into K8s resources

Makes it easy to configure and run local/distributed training jobs of various ML frameworks on K8s

11

TFJob Operator

12 of 26

Notebooks (and beyond)

Orchestration for Notebooks

Create and manage multiple Notebook servers in one place

Integration with advanced DSLs

Kubeflow Fairing: build, train and serve, all from Notebooks
Kubeflow Pipeline SDK: create and deploy workflows from Notebooks

12

13 of 26

Pipelines

Combine individual tasks into end-to-end workflows

Provide orchestration and service integration

Enable components & sharing

Help with job tracking, experimentation, monitoring

13

14 of 26

Katib - AutoML and Hyperparameter Tuning

Provides a service that automates optimization of model hyperparameters and neural network architectures

Makes it easy to track, view and manage experiment results for hyperparameter candidates

Growing suggestion algorithms

Grid search
Bayesian Optimization
Reinforcement Learning
...

14

Zhou, Jinan, et al. "Katib: A Distributed General AutoML Platform on Kubernetes." {USENIX} OpML 19. 2019

15 of 26

Kubebench

A harness for benchmarking ML workloads on Kubernetes

Pipelines and automates common benchmarking tasks (configuration, deployment, result collection, ...)

Evaluate performance of ML models as well as the system stack

Enable deployment of state-of-the-art benchmarking workloads on K8s

15

16 of 26

TensorFlow MNIST Handwritten Digits Detection with Hyperparameter Tuning

Demo

16

Available at: https://github.com/CiscoAI/KFLab/tree/master/pipelines/tf-mnist

17 of 26

Use Case: Conversational Chatbot

Goal: Train, evaluate, and test an AI-powered chatbot for suggesting appointments

The Rasa Stack is a set of open source machine learning tools for developers to create contextual AI assistants and chatbots:

NLU model understands the user’s message (intents, entities, etc.) based on your training data.
Core model decides what happens next in this conversation.
It’s machine learning-based dialogue management predicts the next best action based on the input from NLU, the conversation history, and the training data.

17

rasa.com

18 of 26

Use Case: Initial Solution

Rasa Model Training Pipeline

TensorFlow Serving

Trains Rasa NLU and Rasa Core

models

TF trained models’ metadata

Apache Beam Data Processing

Rasa JSON input format

Training Data

Prepares Training Data for Rasa

Deployed on AWS EC2 machine

19 of 26

Use Case: Kubeflow Solution

Define a Kubeflow Pipeline using the Pipelines Python SDK�
Compile and upload the pipeline package to Kubeflow Pipelines dashboard�
Configure and run the pipeline to train and evaluate the model end-to-end�
Run the trained model with a simple deployed application

19

deploy-model

20 of 26

Comparing Kubeflow with Similar Frameworks

Features	Kubeflow	ML Flow	H2O.ai
Model Training, HP Tuning, Serving Deployments
Scalability - Training, Serving
Integrated Development Environment
Experiment Tracking
Workflow Pipelines
Neural Architecture Search
Portability Across Infrastructure Providers
User Management
Managed Infrastructure and Auto Scaling
Model Management

21 of 26

Cloud Platform Integrations

21

22 of 26

Community

22

23 of 26

What’s Next?

23

Jupyter Notebook

Profiles
Isolation

Katib: Neural Architecture Search
Fairing
On-premise usage
Experiment Tracking with Metadata Schema
Model Management

24 of 26

Agenda

24

ML Lifecycle
Kubeflow - Architecture and Features
Demo
Alternatives to Kubeflow and Comparison
Kubeflow Community

25 of 26

Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes

26 of 26

Action Items

26

Visit www.kubeflow.org
Visit https://github.com/CiscoAI/KFLab/tree/master/pipelines/tf-mnist
Contribute to community