1 of 62

Flyte

Ketan Umare

Jan 24th, 2022

2 of 62

What is Flyte?

3 of 62

Kubernetes-native

Workflow Automation Platform

for Business-critical

Machine Learning and

Data Processes

at Scale

What is Flyte?

4 of 62

What is Flyte?

Define: Workflow

The sequence of industrial, administrative, or other processes through which a piece of work passes from initiation to completion.

Not quite a DAG!

Directed Acyclic Graphs imply no loops/repeats. But complex processes may have repetitions. Runtime is still a DAG.

5 of 62

What is Flyte?

Goal

  • Provide a reproducible, incremental, iterative and extensible workflow automation PaaS for any organization.
  • Focus on user experience, reliability and correctness.
  • Separation of responsibilities between platform teams and user teams.

Flyte Platform Services

User Team 1

User Team 2

Platform Team

Python, java, pytorch, pyspark, tensorflow, etc

K8s, Cloud, Auto-scaling, cluster management, resource management, integrations, performance, monitoring, logs, spark-core, golang, databases etc

Data-scientists, ml engineers, data engineers, product SWEs

Platform/Infra engineers, devOps, SRE

ML

Ops

6 of 62

Challenges of ML Orchestration

What Flyte solves.

7 of 62

Challenge 1

Develop Incrementally & Constantly Iterate @ Scale

Scale the Job

e.g. one region -> all regions, more GPUs

Start with one Job, run it locally

e.g. spark job, a training job, a query etc

Create a pipeline, test it locally

e.g. Fetch data -> train model -> calculate metrics

Execute the pipeline on demand, at scale

e.g. Run a pipeline with parameters

Run the pipeline on a schedule or in-response to an event

e.g. Run every hour

Retrieve results for jobs / pipelines

1

3

5

2

4

6

8 of 62

Challenge 2

Tame Infrastructure, Self-serve & Efficient

  • Centrally managed Infrastructure
  • Access resources — CPU/GPU/Mem, etc
  • Framework/Library independence
  • Multi-tenancy unaware
  • Automatic scaling
  • Knobs to Control costs
  • Efficiency: Spot machines, model checkpointing

9 of 62

Challenge 3

Parameterize Executions & Dynamism

  • In ML, experiments require:
    • Similar algorithms with different parameters / inputs
    • Altering behaviour based on inputs — usually scale
  • Time is nothing special — usually time is used to test different hypothesis
  • Even with dynamism, knowing the pipeline ahead of time is desirable

Input: x=m1

Input: x=n

Input: x=m2

10 of 62

Challenge 4

Memoize, Recover & Reproduce

  • Only re-execute changes
  • Recover from system failures (retries and complete recovery)
  • Timely executions
  • Data-sandboxes for tasks, isolated executions and capture of code snapshots enable reproducibility

Input: x=m1

Input: x=m1

11 of 62

Challenge 5

Collaboration & Organizational Scaling

  • Domain experts can work on their parts independently
  • Other users in the organization can reuse
  • Users are free to use the language of their choice in a unified pipeline
  • Communication happens over a strong interface boundary

critical/complex algorithm

PipelineA

PipelineB

dataA

dataB

Team A

Team B

PipelineC

dataC

Composite Pipeline

  • Composite pipeline is composed of TeamA, PipelineB + other tasks.
  • PipelineC re-uses the shared critical task.

12 of 62

Challenge 6

Extend Simply

Flyte

Vendor A

Inhouse

Vendor B

Consistent API

Organizations want flexibility

Control costs (Migrate vendors, bring capabilities inhouse)

Users velocity and existing code should just work!

3

Users want flexibility

Add simple python extensions (Airflow operators)

Maybe only for their teams

1

Platform wants to keep adding new capabilities

Distributed training support, Spark, Streaming etc

Continue adding and controlling roll-out of features

2

Flytekit makes it easy to add new user customizations

Flyte also allows you to run just your own containers

Flyte backend plugins are independently deployed, maintained and are in the hosted service

Flyte control plane makes it possible to switch plugin associations and OSS makes it possible to migrate

13 of 62

User Level Concepts

14 of 62

Building Blocks: Tasks

  • Smallest unit of work in Flyte
  • Declarative, versioned, language and framework independent
  • Maps to a backend execution plugin (extendable), e.g., containers, SQL queries, pods, WebAPI calls
  • Strong Interface (typed inputs and outputs)

Task

inputs

outputs

15 of 62

Workflows

  • Declarative & composable
  • Models data-flow through tasks
  • Strongly typed interfaces
  • Declarative, versioned and dynamically recursive
  • DSL’s in python, Java, Scala.
  • Python supports imperative model and bring your own DSL

inputs

Task

inputs

outputs

Workflow

inputs

outputs

Task

inputs

outputs

Task

inputs

outputs

Outputs

16 of 62

Write Your Code

  • Write code in Python with type annotations
  • Use DataFrames, PySpark, Python data types, etc.
  • Annotate with @task / @workflow
  • Optional: Enable caching
  • Execute locally

@task(cache=True, cache_version=1.0)

def pay_multiplier(df: pandas.DataFrame, scalar: int) -> pandas.DataFrame:

df["col"] = 2 * df["col"]

return df

@task

def total_spend(df: pyspark.DataFrame) -> int:

return df.agg(F.sum("col")).collect()[0][0]

@workflow

def calculate_spend(emp_df: pandas.DataFrame) -> int:

return total_spend(df=pay_multiplier(df=emp_df, scalar=2))

# Execute

pay_multiplier(df=pandas.DataFrame())

calculate_spend(emp_df=pandas.DataFrame())

17 of 62

Get Ready to Scale!

  • Specify resources for tasks
  • Specify Spark cluster configuration
  • Add resiliency — retries
  • Optional: Configure one or more Schedule / Notification
  • Of course, still execute it locally

@task(limits=Resources(cpu="2", mem="150Mi"))

def pay_multiplier(df: pandas.DataFrame, scalar: int) -> pandas.DataFrame:

df["col"] = 2 * df["col"]

return df

@task(task_config=Spark(

spark_conf={"spark.driver.memory": "1000M"}

), retries=2)

def total_spend(df: pyspark.DataFrame) -> int:

return df.agg(F.sum("col")).collect()[0][0]

@workflow

def calculate_spend(emp_df: pandas.DataFrame) -> int:

return total_spend(df=pay_multiplier(df=emp_df, scalar=2))

LaunchPlan.get_or_create(name="...",

workflow=calculate_spend,

schedule=FixedRate(duration=timedelta(minutes=10)),

notifications=[

Email(

phases=[WorkflowExecutionPhase.FAILED],

recipients_email=[...])]),

)

18 of 62

Workflow Modalities

@workflow - deferred evaluation: deferred to launch!

@dynamic - deferred to the “return” statement

@dynamic(cache=True, cache_version="0.1", limits=Resources(mem="600Mi"))

def parallel_fit_predict(

multi_train: typing.List[pd.DataFrame],

multi_val: typing.List[pd.DataFrame],

multi_test: typing.List[pd.DataFrame],

) -> typing.List[typing.List[float]]:

preds = []

for loc, train, val, test in zip(LOCATIONS, multi_train, multi_val, multi_test):

model = fit(loc=loc, train=train, val=val)

preds.append(predict(test=test, model_ser=model))

return preds

@workflow

def calculate_spend(emp_df: pandas.DataFrame) -> int:

return total_spend(df=pay_multiplier(df=emp_df, scalar=2))

19 of 62

Execute and interact

# Simply run the script from command line

pyflyte run my.mod:wf [--remote]

# Fetch & Create Execution

flytectl get tasks -p proj -d dev core.advanced.run_merge_sort.merge --version v2 --execFile execution_spec.yaml

flytectl create execution --execFile execution_spec.yaml -p proj -d dev --targetProject proj

# Fetch results

flytectl get execution -p proj -d dev oeh94k9r2r --details -o yaml

Jupyter Notebook

Golang cli

20 of 62

21 of 62

UX Overview - UI (rendered graphs)

22 of 62

UX Overview - UI (error traces)

23 of 62

UX Overview - Customized visualizations

  • FlyteDecks allow customized, sharable visualizations

@task

def t2() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]:

return iris_df

24 of 62

Concepts

Projects

Logical grouping & tenant isolation

Launch plans

Customize invocation behavior, multiple schedules, notifications

Execute One task Independently

Build & Debug iteratively

Static DAG compilation

Get errors before execution

Language and Framework Independence

Code in python, java, scala. Execute arbitrary language code.

Local Execution

Implement before you scale

Programmable and inspectable

Retrieve & compare historical results. Create your own centralized artifact repository.

API driven execution

On-Demand, Scheduled and event triggered Execution

25 of 62

User Interaction Model

26 of 62

User Journey

Retrieve & Replay

  1. Retrieve results from executions
  2. Identify production errors
  3. Replay, reproduce historical artifacts
  4. Retrieve artifact lineage

Ideate & Iterate

  1. Write business logic
  2. Test task locally
  3. Test task remote
  4. Orchestrate multiple tasks into a Workflow
  5. Execute the workflow
  6. Repeat

Productionize

  1. Promote a pipeline to production (CI/CD)
  2. Create one or more schedules
  3. Execute ad-hoc
  4. Monitor and get notified

27 of 62

Write Code, Not YAML!

Leave the infrastructure to platform

{

"id": { "resourceType": 1, "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.say_hello", "version": "v0.2.247" },

"type": "python-task",

"container": {

"args": [ "pyflyte-execute", "--inputs", "{{.input}}", "--output-prefix", "{{.outputPrefix}}", "--raw-output-data-prefix", "{{.rawOutputDataPrefix}}", "--resolver", "flytekit.core.python_auto_container.default_task_resolver", "--", "task-module", "core.flyte_basics.hello_world", "task-name", "say_hello" ],

"image": "ghcr.io/flyteorg/flytecookbook:core",

}

@task

def square(n: int) -> int:

return n * n

$ pyflyte --pkgs myapp.workflows package --image ghcr.io/flyteorg/flytecookbook:core

28 of 62

Why Registration?

  • Observable graphs (before execution)
  • Automated tracking
  • Full reproducibility — code snapshot captured, connected with a version (usually git-SHA)
  • Ability to have multiple active versions at the same time
  • Backtrack from predictions to actual code
  • The strongly typed system of Flyte, makes it possible to auto-generate a Launch form, verify inputs and ensure correct execution behavior

29 of 62

Before Serialize

After Serialize

@task

def say_hello() -> str:

return "hello world"

@workflow

def my_wf() -> str:

res = say_hello()

return res

"id": { "resourceType": "WORKFLOW", "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.my_wf", "version": "v0.2.247" },

"tasks": [{"template": { TaskSpec 1 }}]

"closure": { "compiledWorkflow": { "primary": { "template": {

}

}

….

$ pyflyte --pkgs myapp.workflows package --image ghcr.io/flyteorg/flytecookbook:core

$ docker build -t ghcr.io/flyteorg/flytecookbook:core .

$ docker push ghcr.io/flyteorg/flytecookbook:core

30 of 62

Serialize Package

GRPC / REST

"id": { "resourceType": "WORKFLOW", "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.my_wf", "version": "v0.2.247" },

"tasks": [{"template": { TaskSpec 1 }}]

"closure": { "compiledWorkflow": { "primary": { "template": {

}

}

$ flytectl register files -p flytesnacks -d development –version v1.0.0 flyte-packages.tar.gz —archive

31 of 62

Process of Registration

32 of 62

Before Serialize

After Fast Serialize

@task

def say_hello() -> str:

return "hello flyte"

@workflow

def my_wf() -> str:

res = say_hello()

return res

"id": { "resourceType": "WORKFLOW", "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.my_wf", "version": "v0.2.247" },

"tasks": [{"template": { TaskSpec 1 }}]

"closure": { "compiledWorkflow": { "primary": { "template": {

}

}

….

$ pyflyte --pkgs myapp.workflows package --fast --image ghcr.io/flyteorg/flytecookbook:core

Code artifact

33 of 62

Fast Serialize Package

GRPC

Upload

"id": { "resourceType": "WORKFLOW", "project": "flytesnacks", "domain": "development", "name": "core.flyte_basics.hello_world.my_wf", "version": "v0.2.247" },

"tasks": [{"template": { TaskSpec… }}]

"closure": { "compiledWorkflow": { "primary": { "template": {

}

}

$ flytectl register files -p flytesnacks -d development –version v1.0.0 flyte-packages.tar.gz —archive

Blob store

Code artifact

34 of 62

Fast Registration

Code artifact

Re-use container

Blob store

35 of 62

Swiss Army Knife: Sandbox Environment

$ flytectl demo start --version v0.19.1

36 of 62

MLOps meets DevOps

37 of 62

Domains + Registration: DevOps Power!

1

2

3

  • At Lyft, everytime a new PR is created:
    • A docker container is built
    • Register with Flyte
  • If users modify their dependencies, then PR is first required to build containers (securely)
  • For just code changes, users can use fast-register, to snapshot the code from laptops or dev environments
  • Everything is automatically tracked!
  • Multiple users can create isolated PR’s and test independently - @Scale!
  • This is now followed at GoJek, Wolt, Spotify, Toyota etc

38 of 62

Domains + Registration: DevOps Power!

1

2

3

  • Once the code is ready and passes code review, merge to master
  • The users simulate a deployment through a deployment pipeline.
  • At each stage of the deployment, workflows and tasks are registered with specific domain in Flyte
  • Each domain, may change the data-directory, associated roles or some meta attributes like labels, annotations
  • At Lyft, only production domain allows schedules

39 of 62

Domains + Registration: DevOps Power!

1

2

3

  • For production deployments, logs and metrics are automatically tracked.
  • Models are automatically promoted to serving infrastructure (multiple options depending on the team)
  • Users can use interactive notebooks to retrieve intermediate or final outputs, analyze data and also automate various monitoring tasks

40 of 62

Extensibility & Flexibility

41 of 62

Why?

Extensibility & Flexibility

  • It is impossible to build an orchestration platform that is 100% fit
  • As we built it and used by multiple different organizations, we realized that users want to
    • Extend
    • Decorate
    • Contribute
  • This led to highly extensible design of Flyte,
    • Flyte is extensible in every component - the programming SDK, backend, UI etc
  • Flyte also wanted to offer the convenience of a service oriented architecture to data orchestration - no more dealing with couple client code and infrastructure.

42 of 62

Use Cases for Extending Flyte

Extensibility & Flexibility

Use case

Python flytekit plugin

User container plugin

Prebuilt container plugin

Flytekit Type Transformer

Backend Plugin

Meta DSL on flytekit

K8s Plugin

WebAPI plugin

Fancy plugin

Golang? Multi-language support? High performance plugin

Python? Custom extensions, try before invest in backend plugin

Library, with user defined

extensions

Provide prebuilt container — plug & play

Custom specialized domain specific types

Write new experiences for your users

New language SDK

Flyte Service API

Contributions :D

Customized interface, no-code solutions etc

43 of 62

Extensibility & Flexibility

FlyteKit Type Transformers

Allows you to create domain specific types and let Flyte understand them. They can be configured to be auto-loaded.

FlyteKit-Only Task Plugins

Allows syntactic sugar to be provided like a library. Flyte executes a python container, so you can potentially do anything in Flytekit itself.

FlyteKit Data Persistence Plugins

Allows you to persist data to various stores by automatically using URIs, e.g., s3://, gcs://, bq://...

$ pip install flytekitplugins-*

$ pip install flytekitplugins-data-*

44 of 62

Extensibility & Flexibility

DSLs in Other Languages

Use core protobuf to write SDK in any new language!

JAVA/Scala already available (incubating); they are contributed by Spotify

case class GreetTaskInput(name: String)

case class GreetTaskOutput(greeting: String)

class GreetTask

extends SdkRunnableTask(

SdkScalaType[GreetTaskInput],

SdkScalaType[GreetTaskOutput]

) {

override def run(input: GreetTaskInput): GreetTaskOutput = GreetTaskOutput(s"Welcome, ${input.name}!")

}

object GreetTask {

def apply(name: SdkBindingData): SdkTransform =

new GreetTask().withInput("name", name)

}

45 of 62

Extensibility & Flexibility

Backend Plugins

True power of Flyte!

Powerful, stateful plugins: starting multiple containers for a cluster, calling external APIs, performing complex auth flows, etc.

Unified API across languages

Maintain easily: patch-fix without deploying code fixes to users

Migrate seamlessly

@task(

task_config=MPIJob(

num_workers=2,

num_launcher_replicas=1,

slots=1,

),

retries=3, cache=True, cache_version="0.1",

requests=Resources(cpu='1', mem="300Mi"),

limits=Resources(cpu='2'),

)

def horovod_train_task(batch_size: int, buffer_size: int, dataset_size: int) -> FlyteDirectory:

hvd.init()

...

46 of 62

But wait, what about platform folks?

47 of 62

Architecture overview

48 of 62

Flyte Component Layer Cake

49 of 62

Built by Platform Engineers for Platform Teams!

Serverless

Provide a serverless environment for your users (central service)

Platform Builders

Extend & customize everything.

Incremental and recoverable

Best in class support for memoization and complete recovery from any transient failures

Low footprint

Backend written in performant Golang

OAuth2 & SSO

Oauth2 and SSO support available natively in Open source

Observe and Audit

Published monitoring dashboard templates, extensive documentation

Isolated & Secure Execution

Execute with separate Permissions, administer and manage quotas, queues etc

gRPC

Fully documented Service specification

50 of 62

Features for the Platform Folks

  • Platform Failures?
    • recover instantly: try the Recover button
  • Manage Clusters transparently
    • add/remove/drain clusters; isolate teams to specific k8s clusters
  • Full SSO out of the box
  • Add platform-wide defaults
  • Set per project/domain resource limits
  • Use spot machines: platform built smartness - automatically guarantee execution, support for Intra-task checkpointing
  • Continuous deployment: no single point of failure or isolation
  • Set compute limits

51 of 62

Sneak Peek…

52 of 62

UnionML

The easiest way to build and deploy machine learning microservices

UnionML bridges the gap between research and production by providing a unified interface for datasets, model training, and batch, online, and streaming prediction.

53 of 62

Dataset & Model

54 of 62

Dataset & Model

55 of 62

Serve Seamlessly

56 of 62

57 of 62

Flyte Community

58 of 62

Flyte Community

Flyte Overview

  • Communication
    • Flyte Slack (750+)

https://slack.flyte.org

    • Flyte GitHub Discussions

https://github.com/flyteorg/flyte/discussions

  • Announcements
    • Slack
    • Twitter

https://twitter.com/flyteorg

    • Google Group

https://groups.google.com/u/0/a/flyte.org/g/users

  • Publications
    • Flyte Blog

https://blog.flyte.org/

    • Flyte Monthly

https://www.getrevue.co/profile/flyte

  • Sync Up
    • Biweekly OSS Sync Up

https://www.addevent.com/calendar/kE355955

    • YouTube

https://bit.ly/3jL6xOX

59 of 62

Collaborators, Contributors & Integrations

Flyte Overview

  • Lyft
  • Spotify
  • Striveworks
  • USU
  • Freenome
  • WovenPlanet
  • Intel
  • Gojek
  • Union.ai
  • Blackshark
  • Wolt (Doordash)
  • RunX
  • MethaneSAT
  • And many others

60 of 62

The Open Source Community

Flyte team understands well that community is a key factor in building a successful open source project. The team gave me and many others first class onboarding experience which created trust to build a closer collaboration with them.

-- Pradithya Aria Pura, Tech Lead - ML Platform, GoJek

It is a privilege to be part of an open-source community as vibrant and committed as the one Ketan and team have built around Flyte. The diversity of problems the community is tackling, along with the commitment and engagement from the core team is something special, and the team here at Striveworks is excited to be a part of it.

-- Jim Rebesco, co-founder, Striveworks`

Flyte is quickly becoming a workhorse at Freenome. The regular cadence of high quality backward-compatible releases, quick turnarounds on bug fixes, realtime support on Slack, and a vibrant community of helpful power users across many industries have really supercharged our adoption.

-- Jeev Balakrishnan, Tech Lead ML Platform, Freenome

Contributing to flyte.org projects was an incredibly streamlined, supportive, and collaborative experience. Many open-source projects are difficult to work with due to lack of communication, long delays, and inflexibility about making larger changes and supporting new use cases. Haytham and Ketan were both extremely helpful as far as helping me workshop my original idea, giving quick feedback and adding me to the Github organization so I could run CI builds. They really encourage new contributors and new ideas while maintaining a product vision and high code quality.

-- Claire Mcginty, Engineer, Spotify

61 of 62

Recap

Roadmap

  • Capture code snapshots - containers
  • Workflows are means of Auto-checkpointing
  • Flyte constructs data sandboxes for executions
  • Memoize executions and recover
  • Versioned workflows and tasks
  • Workflow graphs before they execute
  • Execution timelines & logs - debug
  • Intermediate inputs & outputs
  • Documentation for code
  • Failures shown inline
  • Historical executions - tied to versions
  • Flytedecks

Observability

Reproducibility

62 of 62

Few more items

Roadmap

  • Work in progress
    • Actor support to allow re-using existing clusters like Ray, Dask, Spark etc
    • This will enable better hyper-parameter searching algorithms, especially with iterative pruning etc
  • What will help you
    • Do not think about clusters, think about the problem
    • Multi-tenant environment
    • Support for caching at task boundaries
    • Intra-task checkpointing and use of spot machines to allow resilient workflows