3 of 62

Kubernetes-native

Workflow Automation Platform

for Business-critical

Machine Learning and

Data Processes

at Scale

What is Flyte?

4 of 62

What is Flyte?

Define: Workflow

The sequence of industrial, administrative, or other processes through which a piece of work passes from initiation to completion.

Not quite a DAG!

Directed Acyclic Graphs imply no loops/repeats. But complex processes may have repetitions. Runtime is still a DAG.

5 of 62

What is Flyte?

Goal

Provide a reproducible, incremental, iterative and extensible workflow automation PaaS for any organization.
Focus on user experience, reliability and correctness.
Separation of responsibilities between platform teams and user teams.

Flyte Platform Services

User Team 1

User Team 2

Platform Team

Python, java, pytorch, pyspark, tensorflow, etc

K8s, Cloud, Auto-scaling, cluster management, resource management, integrations, performance, monitoring, logs, spark-core, golang, databases etc

Data-scientists, ml engineers, data engineers, product SWEs

Platform/Infra engineers, devOps, SRE

Ops

6 of 62

Challenges of ML Orchestration

What Flyte solves.

7 of 62

Challenge 1

Develop Incrementally & Constantly Iterate @ Scale

Scale the Job

e.g. one region -> all regions, more GPUs

Start with one Job, run it locally

e.g. spark job, a training job, a query etc

Create a pipeline, test it locally

e.g. Fetch data -> train model -> calculate metrics

Execute the pipeline on demand, at scale

e.g. Run a pipeline with parameters

Run the pipeline on a schedule or in-response to an event

e.g. Run every hour

Retrieve results for jobs / pipelines

8 of 62

Challenge 2

Tame Infrastructure, Self-serve & Efficient

Centrally managed Infrastructure
Access resources — CPU/GPU/Mem, etc
Framework/Library independence
Multi-tenancy unaware
Automatic scaling
Knobs to Control costs
Efficiency: Spot machines, model checkpointing

9 of 62

Challenge 3

Parameterize Executions & Dynamism

In ML, experiments require:

Similar algorithms with different parameters / inputs
Altering behaviour based on inputs — usually scale

Time is nothing special — usually time is used to test different hypothesis
Even with dynamism, knowing the pipeline ahead of time is desirable

Input: x=m1

Input: x=n

Input: x=m2

10 of 62

Challenge 4

Memoize, Recover & Reproduce

Only re-execute changes
Recover from system failures (retries and complete recovery)
Timely executions
Data-sandboxes for tasks, isolated executions and capture of code snapshots enable reproducibility

Input: x=m1

11 of 62

Challenge 5

Collaboration & Organizational Scaling

Domain experts can work on their parts independently
Other users in the organization can reuse
Users are free to use the language of their choice in a unified pipeline
Communication happens over a strong interface boundary

critical/complex algorithm

PipelineA

PipelineB

dataA

dataB

Team A

Team B

PipelineC

dataC

Composite Pipeline

Composite pipeline is composed of TeamA, PipelineB + other tasks.
PipelineC re-uses the shared critical task.

12 of 62

Challenge 6

Extend Simply

Flyte

Vendor A

Inhouse

Vendor B

Consistent API

Organizations want flexibility

Control costs (Migrate vendors, bring capabilities inhouse)

Users velocity and existing code should just work!

Users want flexibility

Add simple python extensions (Airflow operators)

Maybe only for their teams

Platform wants to keep adding new capabilities

Distributed training support, Spark, Streaming etc

Continue adding and controlling roll-out of features

Flytekit makes it easy to add new user customizations

Flyte also allows you to run just your own containers

Flyte backend plugins are independently deployed, maintained and are in the hosted service

Flyte control plane makes it possible to switch plugin associations and OSS makes it possible to migrate

13 of 62

User Level Concepts

14 of 62

Building Blocks: Tasks

Smallest unit of work in Flyte
Declarative, versioned, language and framework independent
Maps to a backend execution plugin (extendable), e.g., containers, SQL queries, pods, WebAPI calls
Strong Interface (typed inputs and outputs)

Task

inputs

outputs

15 of 62

Workflows

Declarative & composable
Models data-flow through tasks
Strongly typed interfaces
Declarative, versioned and dynamically recursive
DSL’s in python, Java, Scala.
Python supports imperative model and bring your own DSL

inputs

Task

inputs

outputs

Workflow

inputs

outputs

Task

inputs

outputs

Task

inputs

outputs

Outputs

16 of 62

Write Your Code

Write code in Python with type annotations
Use DataFrames, PySpark, Python data types, etc.
Annotate with @task / @workflow
Optional: Enable caching
Execute locally

@task(cache=True, cache_version=”1.0”)

def pay_multiplier(df: pandas.DataFrame, scalar: int) -> pandas.DataFrame:

df["col"] = 2 * df["col"]

return df

@task

def total_spend(df: pyspark.DataFrame) -> int:

return df.agg(F.sum("col")).collect()[0][0]

@workflow

def calculate_spend(emp_df: pandas.DataFrame) -> int:

return total_spend(df=pay_multiplier(df=emp_df, scalar=2))

# Execute

pay_multiplier(df=pandas.DataFrame())

calculate_spend(emp_df=pandas.DataFrame())

17 of 62

Get Ready to Scale!

Specify resources for tasks
Specify Spark cluster configuration
Add resiliency — retries
Optional: Configure one or more Schedule / Notification
Of course, still execute it locally

@task(limits=Resources(cpu="2", mem="150Mi"))

def pay_multiplier(df: pandas.DataFrame, scalar: int) -> pandas.DataFrame:

df["col"] = 2 * df["col"]

return df

@task(task_config=Spark(

spark_conf={"spark.driver.memory": "1000M"}

), retries=2)

def total_spend(df: pyspark.DataFrame) -> int:

return df.agg(F.sum("col")).collect()[0][0]

@workflow

def calculate_spend(emp_df: pandas.DataFrame) -> int:

return total_spend(df=pay_multiplier(df=emp_df, scalar=2))

LaunchPlan.get_or_create(name="...",

workflow=calculate_spend,

schedule=FixedRate(duration=timedelta(minutes=10)),

notifications=[

Email(

phases=[WorkflowExecutionPhase.FAILED],

recipients_email=[...])]),

)

18 of 62

Workflow Modalities

@workflow - deferred evaluation: deferred to launch!

@dynamic - deferred to the “return” statement

@dynamic(cache=True, cache_version="0.1", limits=Resources(mem="600Mi"))

def parallel_fit_predict(

multi_train: typing.List[pd.DataFrame],

multi_val: typing.List[pd.DataFrame],

multi_test: typing.List[pd.DataFrame],

) -> typing.List[typing.List[float]]:

preds = []

for loc, train, val, test in zip(LOCATIONS, multi_train, multi_val, multi_test):

model = fit(loc=loc, train=train, val=val)

preds.append(predict(test=test, model_ser=model))

return preds

@workflow

def calculate_spend(emp_df: pandas.DataFrame) -> int:

return total_spend(df=pay_multiplier(df=emp_df, scalar=2))

19 of 62

Execute and interact

# Simply run the script from command line

pyflyte run my.mod:wf [--remote]

# Fetch & Create Execution

flytectl get tasks -p proj -d dev core.advanced.run_merge_sort.merge --version v2 --execFile execution_spec.yaml

flytectl create execution --execFile execution_spec.yaml -p proj -d dev --targetProject proj

# Fetch results

flytectl get execution -p proj -d dev oeh94k9r2r --details -o yaml

Jupyter Notebook

Golang cli

21 of 62

UX Overview - UI (rendered graphs)

22 of 62

UX Overview - UI (error traces)

23 of 62

UX Overview - Customized visualizations

FlyteDecks allow customized, sharable visualizations

@task

def t2() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]:

return iris_df

24 of 62

Concepts

Projects	Logical grouping & tenant isolation
Launch plans	Customize invocation behavior, multiple schedules, notifications
Execute One task Independently	Build & Debug iteratively
Static DAG compilation	Get errors before execution
Language and Framework Independence	Code in python, java, scala. Execute arbitrary language code.
Local Execution	Implement before you scale
Programmable and inspectable	Retrieve & compare historical results. Create your own centralized artifact repository.
API driven execution	On-Demand, Scheduled and event triggered Execution

25 of 62

User Interaction Model

26 of 62

User Journey

Retrieve & Replay

Retrieve results from executions
Identify production errors
Replay, reproduce historical artifacts
Retrieve artifact lineage

Ideate & Iterate

Write business logic
Test task locally
Test task remote
Orchestrate multiple tasks into a Workflow
Execute the workflow
Repeat

Productionize

Promote a pipeline to production (CI/CD)
Create one or more schedules
Execute ad-hoc
Monitor and get notified

27 of 62

Write Code, Not YAML!

Leave the infrastructure to platform

{

"id": { "resourceType": 1, "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.say_hello", "version": "v0.2.247" },

"type": "python-task",

"container": {

"args": [ "pyflyte-execute", "--inputs", "{{.input}}", "--output-prefix", "{{.outputPrefix}}", "--raw-output-data-prefix", "{{.rawOutputDataPrefix}}", "--resolver", "flytekit.core.python_auto_container.default_task_resolver", "--", "task-module", "core.flyte_basics.hello_world", "task-name", "say_hello" ],

"image": "ghcr.io/flyteorg/flytecookbook:core",

}

@task

def square(n: int) -> int:

return n * n

$ pyflyte --pkgs myapp.workflows package --image ghcr.io/flyteorg/flytecookbook:core

28 of 62

Why Registration?

Observable graphs (before execution)
Automated tracking
Full reproducibility — code snapshot captured, connected with a version (usually git-SHA)
Ability to have multiple active versions at the same time
Backtrack from predictions to actual code
The strongly typed system of Flyte, makes it possible to auto-generate a Launch form, verify inputs and ensure correct execution behavior

29 of 62

Before Serialize

After Serialize

@task

def say_hello() -> str:

return "hello world"

@workflow

def my_wf() -> str:

res = say_hello()

return res

"id": { "resourceType": "WORKFLOW", "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.my_wf", "version": "v0.2.247" },

"tasks": [{"template": { TaskSpec 1 }}]

"closure": { "compiledWorkflow": { "primary": { "template": {

}

….

$ pyflyte --pkgs myapp.workflows package --image ghcr.io/flyteorg/flytecookbook:core

$ docker build -t ghcr.io/flyteorg/flytecookbook:core .

$ docker push ghcr.io/flyteorg/flytecookbook:core

30 of 62

Serialize Package

GRPC / REST

"id": { "resourceType": "WORKFLOW", "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.my_wf", "version": "v0.2.247" },

"tasks": [{"template": { TaskSpec 1 }}]

"closure": { "compiledWorkflow": { "primary": { "template": {

}

$ flytectl register files -p flytesnacks -d development –version v1.0.0 flyte-packages.tar.gz —archive

31 of 62

Process of Registration

32 of 62

Before Serialize

After Fast Serialize

@task

def say_hello() -> str:

return "hello flyte"

@workflow

def my_wf() -> str:

res = say_hello()

return res

"id": { "resourceType": "WORKFLOW", "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.my_wf", "version": "v0.2.247" },

"tasks": [{"template": { TaskSpec 1 }}]

"closure": { "compiledWorkflow": { "primary": { "template": {

}

….

$ pyflyte --pkgs myapp.workflows package --fast --image ghcr.io/flyteorg/flytecookbook:core

Code artifact

33 of 62

Fast Serialize Package

GRPC

Upload

"id": { "resourceType": "WORKFLOW", "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.my_wf", "version": "v0.2.247" },

"tasks": [{"template": { TaskSpec… }}]

"closure": { "compiledWorkflow": { "primary": { "template": {

}

$ flytectl register files -p flytesnacks -d development –version v1.0.0 flyte-packages.tar.gz —archive

Blob store

Code artifact

34 of 62

Fast Registration

Code artifact

Re-use container

Blob store

35 of 62

Swiss Army Knife: Sandbox Environment

$ flytectl demo start --version v0.19.1

36 of 62

MLOps meets DevOps

37 of 62

Domains + Registration: DevOps Power!

At Lyft, everytime a new PR is created:

A docker container is built
Register with Flyte

If users modify their dependencies, then PR is first required to build containers (securely)
For just code changes, users can use fast-register, to snapshot the code from laptops or dev environments
Everything is automatically tracked!
Multiple users can create isolated PR’s and test independently - @Scale!
This is now followed at GoJek, Wolt, Spotify, Toyota etc

38 of 62

Domains + Registration: DevOps Power!

Once the code is ready and passes code review, merge to master
The users simulate a deployment through a deployment pipeline.
At each stage of the deployment, workflows and tasks are registered with specific domain in Flyte
Each domain, may change the data-directory, associated roles or some meta attributes like labels, annotations
At Lyft, only production domain allows schedules

39 of 62

Domains + Registration: DevOps Power!

For production deployments, logs and metrics are automatically tracked.
Models are automatically promoted to serving infrastructure (multiple options depending on the team)
Users can use interactive notebooks to retrieve intermediate or final outputs, analyze data and also automate various monitoring tasks

40 of 62

Extensibility & Flexibility

41 of 62

Why?

Extensibility & Flexibility

It is impossible to build an orchestration platform that is 100% fit
As we built it and used by multiple different organizations, we realized that users want to

Extend
Decorate
Contribute

This led to highly extensible design of Flyte,

Flyte is extensible in every component - the programming SDK, backend, UI etc

Flyte also wanted to offer the convenience of a service oriented architecture to data orchestration - no more dealing with couple client code and infrastructure.

42 of 62

Use Cases for Extending Flyte

Extensibility & Flexibility

Use case

Python flytekit plugin

User container plugin

Prebuilt container plugin

Flytekit Type Transformer

Backend Plugin

Meta DSL on flytekit

K8s Plugin

WebAPI plugin

Fancy plugin

Golang? Multi-language support? High performance plugin

Python? Custom extensions, try before invest in backend plugin

Library, with user defined

extensions

Provide prebuilt container — plug & play

Custom specialized domain specific types

Write new experiences for your users

New language SDK

Flyte Service API

Contributions :D

Customized interface, no-code solutions etc

43 of 62

Extensibility & Flexibility

FlyteKit Type Transformers

Allows you to create domain specific types and let Flyte understand them. They can be configured to be auto-loaded.

FlyteKit-Only Task Plugins

Allows syntactic sugar to be provided like a library. Flyte executes a python container, so you can potentially do anything in Flytekit itself.

FlyteKit Data Persistence Plugins

Allows you to persist data to various stores by automatically using URIs, e.g., s3://, gcs://, bq://...

$ pip install flytekitplugins-*

$ pip install flytekitplugins-data-*

44 of 62

Extensibility & Flexibility

DSLs in Other Languages

Use core protobuf to write SDK in any new language!

JAVA/Scala already available (incubating); they are contributed by Spotify

case class GreetTaskInput(name: String)

case class GreetTaskOutput(greeting: String)

class GreetTask

extends SdkRunnableTask(

SdkScalaType[GreetTaskInput],

SdkScalaType[GreetTaskOutput]

) {

override def run(input: GreetTaskInput): GreetTaskOutput = GreetTaskOutput(s"Welcome, ${input.name}!")

}

object GreetTask {

def apply(name: SdkBindingData): SdkTransform =

new GreetTask().withInput("name", name)

}

45 of 62

Extensibility & Flexibility

Backend Plugins

True power of Flyte!

Powerful, stateful plugins: starting multiple containers for a cluster, calling external APIs, performing complex auth flows, etc.

Unified API across languages

Maintain easily: patch-fix without deploying code fixes to users

Migrate seamlessly

@task(

task_config=MPIJob(

num_workers=2,

num_launcher_replicas=1,

slots=1,

retries=3, cache=True, cache_version="0.1",

requests=Resources(cpu='1', mem="300Mi"),

limits=Resources(cpu='2'),

)

def horovod_train_task(batch_size: int, buffer_size: int, dataset_size: int) -> FlyteDirectory:

hvd.init()

...

46 of 62

But wait, what about platform folks?

47 of 62

Architecture overview

48 of 62

Flyte Component Layer Cake

49 of 62

Built by Platform Engineers for Platform Teams!

Serverless	Provide a serverless environment for your users (central service)
Platform Builders	Extend & customize everything.
Incremental and recoverable	Best in class support for memoization and complete recovery from any transient failures
Low footprint	Backend written in performant Golang
OAuth2 & SSO	Oauth2 and SSO support available natively in Open source
Observe and Audit	Published monitoring dashboard templates, extensive documentation
Isolated & Secure Execution	Execute with separate Permissions, administer and manage quotas, queues etc
gRPC	Fully documented Service specification

50 of 62

Features for the Platform Folks

Platform Failures?

recover instantly: try the Recover button

Manage Clusters transparently

add/remove/drain clusters; isolate teams to specific k8s clusters

Full SSO out of the box
Add platform-wide defaults
Set per project/domain resource limits
Use spot machines: platform built smartness - automatically guarantee execution, support for Intra-task checkpointing
Continuous deployment: no single point of failure or isolation
Set compute limits

51 of 62

Sneak Peek…

52 of 62

UnionML

The easiest way to build and deploy machine learning microservices

UnionML bridges the gap between research and production by providing a unified interface for datasets, model training, and batch, online, and streaming prediction.

53 of 62

Dataset & Model

54 of 62

Dataset & Model

55 of 62

Serve Seamlessly

57 of 62

Flyte Community

58 of 62

Flyte Community

Flyte Overview

Communication

Flyte Slack (750+)

https://slack.flyte.org

Flyte GitHub Discussions

https://github.com/flyteorg/flyte/discussions

Announcements

Slack
Twitter

https://twitter.com/flyteorg

Google Group

https://groups.google.com/u/0/a/flyte.org/g/users

Publications

Flyte Blog

https://blog.flyte.org/

Flyte Monthly

https://www.getrevue.co/profile/flyte

Sync Up

Biweekly OSS Sync Up

https://www.addevent.com/calendar/kE355955

YouTube

https://bit.ly/3jL6xOX

59 of 62

Collaborators, Contributors & Integrations

Flyte Overview

Lyft
Spotify
Striveworks
USU
Freenome
WovenPlanet
Intel
Gojek
Union.ai
Blackshark
Wolt (Doordash)
RunX
MethaneSAT
And many others

60 of 62

The Open Source Community

Flyte team understands well that community is a key factor in building a successful open source project. The team gave me and many others first class onboarding experience which created trust to build a closer collaboration with them.

-- Pradithya Aria Pura, Tech Lead - ML Platform, GoJek

It is a privilege to be part of an open-source community as vibrant and committed as the one Ketan and team have built around Flyte. The diversity of problems the community is tackling, along with the commitment and engagement from the core team is something special, and the team here at Striveworks is excited to be a part of it.

-- Jim Rebesco, co-founder, Striveworks`

Flyte is quickly becoming a workhorse at Freenome. The regular cadence of high quality backward-compatible releases, quick turnarounds on bug fixes, realtime support on Slack, and a vibrant community of helpful power users across many industries have really supercharged our adoption.

-- Jeev Balakrishnan, Tech Lead ML Platform, Freenome

Contributing to flyte.org projects was an incredibly streamlined, supportive, and collaborative experience. Many open-source projects are difficult to work with due to lack of communication, long delays, and inflexibility about making larger changes and supporting new use cases. Haytham and Ketan were both extremely helpful as far as helping me workshop my original idea, giving quick feedback and adding me to the Github organization so I could run CI builds. They really encourage new contributors and new ideas while maintaining a product vision and high code quality.

-- Claire Mcginty, Engineer, Spotify

61 of 62

Recap

Roadmap

Capture code snapshots - containers
Workflows are means of Auto-checkpointing
Flyte constructs data sandboxes for executions
Memoize executions and recover
Versioned workflows and tasks

Workflow graphs before they execute
Execution timelines & logs - debug
Intermediate inputs & outputs
Documentation for code
Failures shown inline
Historical executions - tied to versions
Flytedecks

Observability

Reproducibility

62 of 62

Few more items

Roadmap

Work in progress

Actor support to allow re-using existing clusters like Ray, Dask, Spark etc
This will enable better hyper-parameter searching algorithms, especially with iterative pruning etc

What will help you

Do not think about clusters, think about the problem
Multi-tenant environment
Support for caching at task boundaries
Intra-task checkpointing and use of spot machines to allow resilient workflows

1 of 62

2 of 62

3 of 62

4 of 62

5 of 62

6 of 62

7 of 62

8 of 62

9 of 62

10 of 62

11 of 62

12 of 62

13 of 62

14 of 62

15 of 62

16 of 62

17 of 62

18 of 62

19 of 62

20 of 62

21 of 62

22 of 62

23 of 62

24 of 62

25 of 62

26 of 62

27 of 62

28 of 62

29 of 62

30 of 62

31 of 62

32 of 62

33 of 62

34 of 62

35 of 62

36 of 62

37 of 62

38 of 62

39 of 62

40 of 62

41 of 62

42 of 62

43 of 62

44 of 62

45 of 62

46 of 62

47 of 62

48 of 62

49 of 62

50 of 62

51 of 62

52 of 62

53 of 62

54 of 62

55 of 62

56 of 62

57 of 62

58 of 62

59 of 62

60 of 62

61 of 62

62 of 62