1 of 16

2 of 16

Yuan Tang (Akuity) and Andrey Velichkevich (Apple)

Managing Thousands of Automatic Machine Learning Experiments With Argo and Katib

3 of 16

Kubeflow Overview

4 of 16

What is Katib ?

Production Ready OSS Project for AutoML.
Support Hyperparameter Tuning, Neural Architecture Search, and Early Stopping.
Platform to develop and evaluate custom AutoML algorithms.
Can orchestrate any Kubernetes custom resources.
Agnostic to ML frameworks and languages.
Natively integrated with other Kubeflow components (Training, Notebooks, Pipelines).

5 of 16

Katib Architecture

6 of 16

Challenges in HP Tuning

7 of 16

Argo Workflows

Machine learning pipelines
Data processing/ETL
Infrastructure automation
Continuous delivery/integration

https://github.com/terrytangyuan/awesome-argo

8 of 16

Memoization Cache

Step A

Cache Store

If the cache is outdated, re-run the step.

If the cache is still fresh (within a configurable time, e.g. <10s), retrieving cache and use it.

Step B

Creating cache (if not exist yet)

Cache is saved as ConfigMaps

Let’s first talk about memoization cache functionality in Argo Workflows that will be leveraged when dealing with the preprocessing challenge that Andrey mentioned previously.

Argo Workflows controller creates cache which can save the output of a step to be used in a next step. For example here, Step B requires the output from the previous step A. When the workflow is executed for the first time, Argo Workflows will create a cache for step A. The cache contains the result of step A and is saved as a key-value pair in a Kubernetes configmap. Once step A finishes, step B will be executed.

The next time so when the same workflow executes again, it will check whether a cache from step A already exists and whether it’s still fresh. For example, if the cache is created 10 seconds ago and that step B thinks this is fresh, it will retrieve the saved output from the cache and use it directly in step B without wasting resources and time to re-execute step A.

9 of 16

Memoization Cache - Example

Cache (K8s ConfigMap):

10 of 16

Memoization in ML Workflows

Data ingestion

Model training

Cache store

The data has NOT been updated recently.

The data has already been updated recently.

Triggers a ML pipeline with Argo Workflow

Triggers a Katib Experiment

Let's take a real-world machine learning workflow as an example to see how memoization can be leveraged. Assume that we triggered a Katib experiment that executes a machine learning pipeline using Argo Workflows.

A simple ML workflow may look like this. First, there’s data ingestion step that's responsible to ingest data from the data source. You may have a cache store that's in place using Argo or Kubernetes to check whether the data data has been updated or not recently in order to skip this particular data ingestion step if nothing changes in the dataset. Otherwise, you would have to execute that data ingestion from scratch which costs a lot of computational resources. After we've ingested the data, we start model training. Model training can have multiple workers and multiple data shards depending on the selected distributed training strategy. Here for example, we are running the distributed model training step using allreduce. The model training may consist of code written in frameworks such as tensorflow or pytorch and then you can use Kubeflow to submit a distributed tensorflow training job so that the algorithm developers don't have to worry about the infrastructure side of things.

Kubeflow will communicate with kubernetes request necessary computational resources for each of the workers and parameter servers so that TensorFlow can just focus on the algorithms or the models.

We can also use Katib for more complicated model training that leverages hyperparameter tuning, neural architecture search, early stopping and so on.

11 of 16

Memoization Cache

Sequential steps:

Data ingestion step:

Distributed model training step:

Let’s take a look at how this can be achieved with Argo Workflows. On the left hand side, we define the entrypoint of the workfow which consists of sequential steps for both data ingestion and distributed TensorFlow training. The data ingestion step takes a parameter that represents the location of the data that we will save to once data ingestion is finished. In the data ingestion step, we save the ingested dataset to the specified s3 path and then cache the location with max age of one hour. And then in the distributed model training step, we are training a TensorFlow model using Kubeflow’s TFJob with the dataset that we just saved. When this workflow gets executed again within within an hour, the data ingestion step will be skipped and the distributed training step will reuse the previously generated dataset.

12 of 16

Multi-objective Optimization Pipeline

Data Ingestion

Logistic Regression

Accuracy: 68%

Neural Networks

AUC: 76%

Decision Trees

Loss: 90%

Metrics Collection

Hyper-parameters Suggestion

Triggers a ML pipeline with Argo Workflow with the new suggested hyperparameters

13 of 16

Multi-objective Optimization Pipeline

DAG:

Data ingestion steps:

Model training steps:

14 of 16

Demo: Caching

15 of 16

Join Argo Workflows and Katib Community

Follow this guide to run Argo Workflows example in Katib.
Join the Argo Workflows and Katib Community meetings to get the latest updates.
Check the Argo Workflow and Katib GitHub repositories.
Join the Argo Workflows and Katib Slack channel.
If you are using Katib please update the Adopters list.
Learn more about Katib in the presentation list.
Personal Contact:

Yuan Tang - @TerryTangYuan
Andrey Velichkevich - @andreyvelichk

16 of 16

Thank you!