1 of 22

Tommy Li, Senior Software Developer, IBM

The Complexity on Scaling ML Pipelines on Kubernetes using Tekton

2 of 22

Tekton Pipeline

3 of 22

OpenShift Pipelines

OpenShift Container Platform is the industry’s leading Enterprise Kubernetes platform, and it brings out-of-box many features for developers, among which CI/CD capabilities.
OpenShift Pipelines is based on the Tekton project and offers a native integration with the OpenShift platform to provide a smooth experience for the developers.
Certified by Red Hat for OpenShift
Enterprise version available on OpenShift

4 of 22

What Tekton provides out of the box

Run workflows
Construct and create new pods for each task
String matching Conditions
Parameter passing
API that connects to custom controllers
Able to optimize workflows from the controller level and provide abstracted templates.

5 of 22

Useful Tekton Features that Empowers ML Pipelines

Tekton Finally
Tekton API standard and Spec abstraction
Tekton workspaces
Tekton custom tasks
Tekton termination logics
Tekton matrix and looping
CEL condition expression

6 of 22

Tekton High level flow

7 of 22

Tekton Limitations on ML Workflow

No Caching out of the box
Limited capability from the Python SDK
No garbage collection policy out of the box
No log archival out of the box
No out of the box parameter sharing capability across pipelines
Limited scalability due to relying the whole pipeline status stored in etcd
Difficult to track metadata lineage

8 of 22

Kubeflow Pipelines

9 of 22

Initial Optimization leveraging Kubeflow Pipelines

10 of 22

What Kubeflow Pipelines on Tekton V1 provides

Tekton Pipelines garbage collection (reduce k8s etcd disk size)
Python DSL and API Server to optimize queries to the Kubernetes API
Caching using pod mutations (No impact to Tekton controller logics)
Advanced conditions using the custom task controller.
Pipeline loop/sub-pipeline logic using the custom task controller.
Log archival and common storage setup
Easy to plugin helper functions for custom usage such as wait for files.
Preliminary metadata tracking

11 of 22

Limitation with KFP-Tekton V1

12 of 22

Moving to Kubeflow Pipelines V2

13 of 22

What is Smart Runtime

14 of 22

Abstraction Layer Features

15 of 22

Kubeflow Pipelines V2 Design Charts: Drivers

16 of 22

Kubeflow Pipelines V2 Design Charts: Publishers

17 of 22

Kubeflow Pipelines on Tekton 2.0 High Level Flow

Custom Task Controller: New Implementation

Handle Root DAG, sub-DAG, and CONTAINER drivers as well as DAG Publisher

Reuse the driver code under v2/driver to implement the custom task controller and connect it with the Tekton taskrun reconcile logic using the Tekton package directly. This removes the extra task CR from each KFP Task.

Condition and parameter pulling/storing are done as part of the driver and publisher code. We can add CEL package support for special expressions without the need for another when expression.

18 of 22

Kubeflow Pipelines on Tekton V2 brings

Running a custom task to handle caching, skipping conditions, and parameter handling all in one place.
A publisher binary will be run along with user code to upload all task parameters into the ml-metadata service.

Bypass the Tekton parameter limit

Pipeline status and graph structure all extracted into the ml-metadata service

No longer need to render the Tekton pipelines from its raw CRD yaml format

20 of 22

Kubeflow Pipelines on Tekton 2.0 Demo

Using the same SDK and UI images from the KFP 2.0 upstream
Only updated KFP 2.0 backend microservices by adding the new Tekton module into the abstracted runtime interface.
Extend Tekton Pipeline default deployment to include Custom tasks for KFP runtime optimization and support sub-dag.

21 of 22

Kubeflow Pipelines on Tekton 2.0 Future Optimization

Custom Task Controller: Future Implementation

Handle DAG-Driver, and DAG-Publisher in the DAG controller

Move DAG Driver and Publisher logic to the DAG controller reconcile logic. Provide configurations to reconcile using the native Tekton CR and leverage the Tekton resource and security settings.
It also reduces graph complexity as the pipeline doesn’t need to create extra graph edges to connect all the root and leave nodes.
Supporting Status IR in the Kubeflow Pipelines community to handle all the status within the metadata service scope.
Extend looping CRD status to optionally offload to other storage for power users.