1 of 46

Contributors:

thealamkin@google.com | @taylamkin

josh.bottum@canonical.com

david.aronchick@microsoft.com | @aronchick�carmine.rimi@canonical.com | @carminerimi

2019-03-12

Kubeflow

Contributors Summit

Product Manager Update

2 of 46

Agenda

Building the Kubeflow Ecosystem
Kubeflow User Survey
Collecting User Input for Releases
The Release Schedule

3 of 46

David Aronchick�david.aronchick@microsoft.com | @aronchick

Building the�Kubeflow �Ecosystem

4 of 46

Kubecon 2017

5 of 46

The Problem

Setting up an ML stack/pipeline is incredibly hard
Setting up a production ML stack/pipeline is even hardER
Setting up an ML stack/pipeline that ports between the 81% of enterprises that use multi-cloud* environments is EVEN HARDER.

* Note: For the purposes of today, “local” is a specific type of “multi-cloud”

6 of 46

Kubeflow Contributor Summit 2018

7 of 46

Kubeflow Contributor Summit 2018

8 of 46

Mission�(2017)

9 of 46

Make it Easy for Everyone

to Develop, Deploy and Manage Portable, Distributed ML�on Kubernetes

10 of 46

Mission�(2018)

11 of 46

Make it Easy for Everyone

to Develop, Deploy and Manage Portable, Distributed ML�on Kubernetes

12 of 46

Mission�(2019)

13 of 46

Make it Easy for Everyone

to Develop, Deploy and Manage Portable, Distributed ML�on Kubernetes

14 of 46

Kubeflow

15 of 46

Summary

Kubeflow = Cloud Native, multi-cloud solution for ML.
Kubeflow provides a platform for composable, portable and scalable ML pipelines.
If you have a Kubernetes conformant cluster, you can run Kubeflow.

16 of 46

Cloud

Training

Experimentation

17 of 46

Critical User Journey Comparison

2017

Experiment with Jupyter
Distribute your training with TFJob
Serve your model with Seldon

2019

Setup locally with miniKF
Access your cluster with Istio/Ingress
Ingest your data with Pachyderm
Transform your data with TF.T
Analyze the data with TF.DV
Experiment with Jupyter
Hyperparam sweep with Katib
Distribute your training with TFJob
Analyze your model with TF.MA
Serve your model with Seldon
Orchestrate everything with KF.Pipelines

18 of 46

Momentum!

~4000 commits
~200 community contributors
~50 companies contributing, including:

19 of 46

Community Contributions

NOT�GOOGLE

GOOGLE

Kubernetes

Kubeflow

NOT�GOOGLE

GOOGLE

20 of 46

Community Contribution Katib from NTT

Pluggable microservice architecture for HP tuning

Different optimization algorithms
Different frameworks

StudyJob (K8s CRD)

Hides complexity from user
No code needed to do HP tuning

20

HP Tuning is critical to training high quality models.

With Katib, we are creating a K8s native implementation of a hyperparameter tuning system

* Implemented as a set of containerized microservices

* Depend on K8s to deploy and manage the microservices

* Using K8s custom controllers to implement the control loop

The StudyJob controller makes it easy for users to declaratively specify the optimization space, the algorithm they want to use, and how they want to train their model.

* Users can use any K8s resource (including Kubeflow ML specific resources like TFJob or PyTorch job) to train their model.

Algorithms are implemented as microservices, so users can select one of the canned algorithms or implement their own.

Finally, the user needs a database to keep track of models and other data needed for the search.

The end result is a system consisting of multiple components that need to be deployed and wired together. Making this easy for user depends on the work Kubeflow is doing to create a simple story for deploying and managing the platform.

21 of 46

Community Contribution TensorRT from NVidia

Production datacenter inferencing server
Maximize real-time inference performance of GPUs

Multiple models per GPU per node
Supports heterogeneous GPUs & multi GPU nodes

Integrates with orchestration systems and auto scalers via latency and health metrics

21

HP Tuning is critical to training high quality models.

With Katib, we are creating a K8s native implementation of a hyperparameter tuning system

* Implemented as a set of containerized microservices

* Depend on K8s to deploy and manage the microservices

* Using K8s custom controllers to implement the control loop

The StudyJob controller makes it easy for users to declaratively specify the optimization space, the algorithm they want to use, and how they want to train their model.

* Users can use any K8s resource (including Kubeflow ML specific resources like TFJob or PyTorch job) to train their model.

Algorithms are implemented as microservices, so users can select one of the canned algorithms or implement their own.

Finally, the user needs a database to keep track of models and other data needed for the search.

The end result is a system consisting of multiple components that need to be deployed and wired together. Making this easy for user depends on the work Kubeflow is doing to create a simple story for deploying and managing the platform.

22 of 46

Community Contribution Argo from Intuit

Argo CRD for workflows
Argo CRD is engine for Pipelines (more on that later)
Argo CD for GitOps

22

HP Tuning is critical to training high quality models.

With Katib, we are creating a K8s native implementation of a hyperparameter tuning system

* Implemented as a set of containerized microservices

* Depend on K8s to deploy and manage the microservices

* Using K8s custom controllers to implement the control loop

The StudyJob controller makes it easy for users to declaratively specify the optimization space, the algorithm they want to use, and how they want to train their model.

* Users can use any K8s resource (including Kubeflow ML specific resources like TFJob or PyTorch job) to train their model.

Algorithms are implemented as microservices, so users can select one of the canned algorithms or implement their own.

Finally, the user needs a database to keep track of models and other data needed for the search.

The end result is a system consisting of multiple components that need to be deployed and wired together. Making this easy for user depends on the work Kubeflow is doing to create a simple story for deploying and managing the platform.

23 of 46

Community Contribution Notebooks & Storage from

Arrikto

Core Notebook Experience

0.4: New JupyterHub-based UI

Multiple Persistent Volumes

0.5: K8s-Native Notebooks UI

Pipelines: Support for local storage

MiniKF: All-in-one packaging for seamless local deployments

23

24 of 46

But We Need �Help!

STILL

25 of 46

Ok! But WHAT�Do We Need…?

26 of 46

Kubeflow

User Survey Update

Contributors:

josh.bottum@canonical.com, thealamkin@google.com,

2019-03-12

27 of 46

28 of 46

29 of 46

Hybrid

vmWare or OpenStack

30 of 46

Iteration/Tracking Experiments

31 of 46

Simplified end-to-end workflows

Automatic Hyperparameter tuning

32 of 46

Roadmap Directions

Simplified 1st use

through landing page and dashboard

Simplified end-to-end workflows

with integrated build, train and deploy

Enterprise readiness

Authentication, Isolation, multi-user

Plus your input!

33 of 46

Collecting User Input for Upcoming Releases

Contributors:

josh.bottum@canonical.com, thealamkin@google.com,

2019-03-12

34 of 46

Process to collect user input on Roadmap items (CUJs)

Goals

Develop a simplified process to collect user input on roadmap items
Build end-user energy & ownership of new features

Need to Balance

Which Personas
Consistent input vs. new views
Easy CUJ creation & Easy collection of user feedback

Proposed Process

http://bit.ly/2TsPYfg
Major releases may require extra data collection (Kubeflow 1.0)

35 of 46

CUJ Format

Kubeflow Release Mgr PM will work to identify ~3 topics for data collection

Kubeflow Engineering Leads will develop CUJs and feature descriptions

Target - Week 3 of 12

Kubeflow Engineering Leads will develop 1-2 questions that data gather sessions will gather (prior to the meeting)

Kubeflow Engineers will identify the preferred personas for the questions

36 of 46

CUJ Delivery

CUJs / Feature Pack Reviews
1 to 3 Reviews in each format

in a Group format
in a 1:1 format

Reviews should be 20-30 minutes.

Kubeflow OutReach PM Responsibilities

Set-up meetings

Target: 50% of input from repeat end-users

Take notes

Validate input (before closing the call)
Store in Google docs & posted edited notes

37 of 46

Outreach & Local Events

Contributors:

josh.bottum@canonical.com, thealamkin@google.com,

2019-03-12

38 of 46

Kubeflow Days

Goals

Increase Kubeflow adoption
Simplify on-site updates to top Communities
Develop local end-user presenters and use cases

Kubeflow Day LA (at SoCal Linux Expo - March 2019)

11 vendor sessions

Google, Microsoft, Cisco, Arrikto, Redhat, Canonical, MavenCodes

350+ Paid Registrations, 120+ attendees

TicketMaster, Disney, Dreamworks, Sony, Honey,+,+

Attendee Feedback

Core contributor sessions received more feedback / appreciation

39 of 46

Carmine Rimi�carmine.rimi@canonical.com | @carminerimi

The Release Cycle

40 of 46

Critical User Journeys

User: One or more Target Personas
CUJs:

GitHub KF/KF CUJ Labels

CUJ Beginner Guide

Target Personas:

Data Scientist
ML Engineer 
Data Engineer 
Application Engineer 
Platform Engineer 
Infrastructure Engineer
Operations Engineer
Automation Engineer
Manager

41 of 46

CUJ Process(es)

Identify “Critical” Persona / Feature / Experience Gap
Write Draft CUJ
Circulate, Consolidate Feedback, Finalize

Feedback from Target Persona {before | during | after}¹⁺

Create Label + Issues
Possible Roadmap Assignment
Finalize CUJ (add issue query)

Identify

Draft

“Final”

Label &

Issues

Roadmap

Final

42 of 46

Release Cycles

Phase 0

Themes, CUJs, Debt
Kubeflow Workstreams - Coordinators

Phase 1

Scope, Resources, Quality
(time already established)

Phase 2

Execution, Tracking
Releasing ..

Release N-1

Release N

Release N+1

3+ months

43 of 46

Integrating

44 of 46

Integrating

CI (Build, E2E Tests)
Deployment
Lifecycle
Pipeline
Docs
Certification? (Portability)

Code
Examples

45 of 46

Thank you

Carmine Rimi�carmine.rimi@canonical.com | @carminerimi

46 of 46

Kubeflow

Contributor Summit Product Manager Update

End