1 of 22

Tommy Li, Senior Software Developer, IBM

The Complexity on Scaling ML Pipelines on Kubernetes using Tekton

2 of 22

Tekton Pipeline

3 of 22

OpenShift Pipelines

  • OpenShift Container Platform is the industry’s leading Enterprise Kubernetes platform, and it brings out-of-box many features for developers, among which CI/CD capabilities.
  • OpenShift Pipelines is based on the Tekton project and offers a native integration with the OpenShift platform to provide a smooth experience for the developers.
  • Certified by Red Hat for OpenShift
  • Enterprise version available on OpenShift

4 of 22

What Tekton provides out of the box

  • Run workflows
  • Construct and create new pods for each task
  • String matching Conditions
  • Parameter passing
  • API that connects to custom controllers
  • Able to optimize workflows from the controller level and provide abstracted templates.

5 of 22

Useful Tekton Features that Empowers ML Pipelines

  • Tekton Finally
  • Tekton API standard and Spec abstraction
  • Tekton workspaces
  • Tekton custom tasks
  • Tekton termination logics
  • Tekton matrix and looping
  • CEL condition expression

6 of 22

Tekton High level flow

7 of 22

Tekton Limitations on ML Workflow

  • No Caching out of the box
  • Limited capability from the Python SDK
  • No garbage collection policy out of the box
  • No log archival out of the box
  • No out of the box parameter sharing capability across pipelines
  • Limited scalability due to relying the whole pipeline status stored in etcd
  • Difficult to track metadata lineage

8 of 22

Kubeflow Pipelines

9 of 22

Initial Optimization leveraging Kubeflow Pipelines

10 of 22

What Kubeflow Pipelines on Tekton V1 provides

  • Tekton Pipelines garbage collection (reduce k8s etcd disk size)
  • Python DSL and API Server to optimize queries to the Kubernetes API
  • Caching using pod mutations (No impact to Tekton controller logics)
  • Advanced conditions using the custom task controller.
  • Pipeline loop/sub-pipeline logic using the custom task controller.
  • Log archival and common storage setup
  • Easy to plugin helper functions for custom usage such as wait for files.
  • Preliminary metadata tracking

11 of 22

Limitation with KFP-Tekton V1

12 of 22

Moving to Kubeflow Pipelines V2

13 of 22

What is Smart Runtime

14 of 22

Abstraction Layer Features

15 of 22

Kubeflow Pipelines V2 Design Charts: Drivers

16 of 22

Kubeflow Pipelines V2 Design Charts: Publishers

17 of 22

Kubeflow Pipelines on Tekton 2.0 High Level Flow

Custom Task Controller: New Implementation

Handle Root DAG, sub-DAG, and CONTAINER drivers as well as DAG Publisher

  • Reuse the driver code under v2/driver to implement the custom task controller and connect it with the Tekton taskrun reconcile logic using the Tekton package directly. This removes the extra task CR from each KFP Task.

  • Condition and parameter pulling/storing are done as part of the driver and publisher code. We can add CEL package support for special expressions without the need for another when expression.

18 of 22

Kubeflow Pipelines on Tekton V2 brings

  • Running a custom task to handle caching, skipping conditions, and parameter handling all in one place.
  • A publisher binary will be run along with user code to upload all task parameters into the ml-metadata service.
    • Bypass the Tekton parameter limit
  • Pipeline status and graph structure all extracted into the ml-metadata service
    • No longer need to render the Tekton pipelines from its raw CRD yaml format

19 of 22

20 of 22

Kubeflow Pipelines on Tekton 2.0 Demo

  • Using the same SDK and UI images from the KFP 2.0 upstream
  • Only updated KFP 2.0 backend microservices by adding the new Tekton module into the abstracted runtime interface.
  • Extend Tekton Pipeline default deployment to include Custom tasks for KFP runtime optimization and support sub-dag.

21 of 22

Kubeflow Pipelines on Tekton 2.0 Future Optimization

Custom Task Controller: Future Implementation

Handle DAG-Driver, and DAG-Publisher in the DAG controller

  • Move DAG Driver and Publisher logic to the DAG controller reconcile logic. Provide configurations to reconcile using the native Tekton CR and leverage the Tekton resource and security settings.
  • It also reduces graph complexity as the pipeline doesn’t need to create extra graph edges to connect all the root and leave nodes.
  • Supporting Status IR in the Kubeflow Pipelines community to handle all the status within the metadata service scope.
  • Extend looping CRD status to optionally offload to other storage for power users.

22 of 22

Links

Thank you