1 of 26

The Machine Learning Toolkit for Kubernetes

Krishna Durai, Cisco Systems, Bangalore

2 of 26

Impact of Machine Learning

2

3 of 26

Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes

4 of 26

Agenda

4

  • ML Lifecycle
  • Kubeflow - Architecture and Features
  • Demo
  • Alternatives to Kubeflow and Comparison
  • Kubeflow Community

5 of 26

ML Lifecycle: Perception vs. Reality

5

Source: KubeCon 2018 (Budapest) Talk: Building ML Products with Kubeflow

6 of 26

Building ML Products

6

Building�a�Model

Logging

Data�Ingestion

Data�Analysis

Data�Transform�-ation

Data�Validation

Data Splitting

Trainer

Model�Validation

Training�At Scale

Roll-out

Serving

Monitoring

Source: Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape

7 of 26

Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes

8 of 26

Kubeflow runs on Kubernetes

8

General Applications

Infrastructure

(Cloud/On-Prem)

Container Runtimes

ML Workloads

(Modelling, training, roll-out, serving, ...)

Infrastructure

(Cloud/On-Prem)

Container Runtimes

9 of 26

Kubeflow Current Features

9

Building�a�Model

Logging

Data�Ingestion

Data�Analysis

Data�Transform�-ation

Data�Validation

Data Splitting

Trainer

Model�Validation

Training�At Scale

Roll-out

Serving

Monitoring

Source: Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape

10 of 26

Kubeflow Architecture

Make it easy to deploy and administer a platform

  • Leverage Kubeflow native & non Kubeflow components

Tie it together using

  • Orchestration
  • Metadata

10

Libraries and CLIs

Higher Level Services

Low Level APIs / Services

Arena

kfctl

kubectl

Katib

Pipelines

Notebooks

Fairing

TFJob

PyTorchJob

Jupyter CR

Seldon CR

Kubebench

Metadata

Orchestration

Pipelines CR

Argo

Study Job

MPIJob

Spark Job

Model DB

TFX

Developed By Kubeflow

Developed Outside Kubeflow

Adapted from Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape (Not all components are shown)

11 of 26

Job Operators (TF/PyTorch/MPI/...)

Provides a set of K8s Custom Resources that turns distributed concepts in ML frameworks into K8s resources

Makes it easy to configure and run local/distributed training jobs of various ML frameworks on K8s

11

TFJob Operator

12 of 26

Notebooks (and beyond)

Orchestration for Notebooks

  • Create and manage multiple Notebook servers in one place

Integration with advanced DSLs

  • Kubeflow Fairing: build, train and serve, all from Notebooks
  • Kubeflow Pipeline SDK: create and deploy workflows from Notebooks

12

13 of 26

Pipelines

Combine individual tasks into end-to-end workflows

Provide orchestration and service integration

Enable components & sharing

Help with job tracking, experimentation, monitoring

13

14 of 26

Katib - AutoML and Hyperparameter Tuning

Provides a service that automates optimization of model hyperparameters and neural network architectures

Makes it easy to track, view and manage experiment results for hyperparameter candidates

Growing suggestion algorithms

    • Grid search
    • Bayesian Optimization
    • Reinforcement Learning
    • ...

14

Zhou, Jinan, et al. "Katib: A Distributed General AutoML Platform on Kubernetes." {USENIX} OpML 19. 2019

15 of 26

Kubebench

A harness for benchmarking ML workloads on Kubernetes

Pipelines and automates common benchmarking tasks (configuration, deployment, result collection, ...)

Evaluate performance of ML models as well as the system stack

Enable deployment of state-of-the-art benchmarking workloads on K8s

15

16 of 26

TensorFlow MNIST Handwritten Digits Detection with Hyperparameter Tuning

Demo

16

17 of 26

Use Case: Conversational Chatbot

Goal: Train, evaluate, and test an AI-powered chatbot for suggesting appointments

The Rasa Stack is a set of open source machine learning tools for developers to create contextual AI assistants and chatbots:

  • NLU model understands the user’s message (intents, entities, etc.) based on your training data.
  • Core model decides what happens next in this conversation.
  • It’s machine learning-based dialogue management predicts the next best action based on the input from NLU, the conversation history, and the training data.

17

18 of 26

Use Case: Initial Solution

Rasa Model Training Pipeline

TensorFlow Serving

Trains Rasa NLU and Rasa Core

models

TF trained models’ metadata

Apache Beam Data Processing

Rasa JSON input format

Training Data

Prepares Training Data for Rasa

Deployed on AWS EC2 machine

19 of 26

Use Case: Kubeflow Solution

  • Define a Kubeflow Pipeline using the Pipelines Python SDK�
  • Compile and upload the pipeline package to Kubeflow Pipelines dashboard�
  • Configure and run the pipeline to train and evaluate the model end-to-end�
  • Run the trained model with a simple deployed application

19

deploy-model

20 of 26

Comparing Kubeflow with Similar Frameworks

Features

Kubeflow

ML Flow

H2O.ai

Model Training, HP Tuning, Serving Deployments

Scalability - Training, Serving

Integrated Development Environment

Experiment Tracking

Workflow Pipelines

Neural Architecture Search

Portability Across Infrastructure Providers

User Management

Managed Infrastructure

and Auto Scaling

Model Management

21 of 26

Cloud Platform Integrations

21

22 of 26

Community

22

23 of 26

What’s Next?

23

  • Jupyter Notebook
    • Profiles
    • Isolation
  • Katib: Neural Architecture Search
  • Fairing
  • On-premise usage
  • Experiment Tracking with Metadata Schema
  • Model Management

24 of 26

Agenda

24

  • ML Lifecycle
  • Kubeflow - Architecture and Features
  • Demo
  • Alternatives to Kubeflow and Comparison
  • Kubeflow Community

25 of 26

Makes it Easy for Everyone to Develop, Deploy and Manage a Portable, Distributed and Scalable ML system on Kubernetes

26 of 26

Action Items

26