1 of 40

2 of 40

Savin Goyal, CTO at Outerbounds

Yuan Tang, Founding Engineer at Akuity & Argo Project Lead

Beyond Prototypes: Production-Ready ML Systems with Metaflow and Argo

3 of 40

4 of 40

Beyond Prototypes

Building

Production Ready

ML Systems

5 of 40

Build a Cathedral?

Decide and plan a specific deliverable
Allocate a large amount of time and resources
Prepare to spend a long time building it

Results can be gorgeous but are fundamentally static and hard to change.

Also, you can’t afford building too many of these.

6 of 40

Set up a Bazaar?

Invite people to share their best ideas and work together as a loose collective.
Low initial investment, easy to get started.
Quick to experiment with ideas, react to changing demands.

The results can be exciting but limited in impact due to lack of coordinated, goal-oriented investment.

Also, hard to maintain results in the long-term.

7 of 40

Cathedral-style

Bazaar-style

How to engineer systems powered by ML?

Too constrained

Too expensive

Lack of experimentation

Results too static

Too chaotic

Unpredictable results

Little cumulative progress

Limited long-term value

Farm-style!

Plan ahead, react to results

Grow investment as needed

Iterate on the best ideas

Produces long-term value

8 of 40

Establish an ML farm!

Culture of experimentation

Make it easy to test diverse ideas
Nurture them enough that you start seeing actual value (or not)
Double-down on ideas that work

The best ML projects never end

Keep fixing, iterating, and improving
Deliver value year after year

Common tooling & infrastructure saves time & effort, makes it easier to collaborate

9 of 40

Valuable

System

Engineer

runs reliably without human supervision

produces correct results

Traditional Software

10 of 40

Valuable

System

Engineer

Data Scientist

ML Model

Data

✨ New✨

ML-powered Software

11 of 40

Valuable

System

Engineer

Data Scientist

ML Model

Data

✨ New✨

produces correct results

runs reliably without human supervision

ML-powered Software

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

How much infrastructure is needed

How much data scientist cares

20 of 40

21 of 40

class MyFlow(FlowSpec):

@step

def start(self):

import pandas as pd

self.df = pd.DataFrame(big_one)

self.next(self.end)

@step

def end(self):

self.model = train(self.df)

Workflows

Define workflows with a human-friendly syntax

22 of 40

class QueryFlow(FlowSpec):

@step

def query(self):

self.ctas = "CREATE TABLE %s AS %s" % (self.table, self.sql)

query = wr.athena.start_query_execution(self.ctas)

output = wr.athena.wait_query(query)

loc = output['ResultConfiguration']['OutputLocation']

with metaflow.S3() as s3:

results = [obj.url for obj in s3.list_recursive([loc])

Data

Comes with tools for fast-data access

23 of 40

class MyFlow(FlowSpec):

@step

def start(self):

self.alpha = 0.5

self.next(self.train)

@step

def train(self):

self.model = train_model(self.alpha)

Versioning

Everything gets versioned automatically

24 of 40

Tracking & namespaces

Track everything by default

@card

@project(name='LTV')

class TrainingFlow(FlowSpec):

@pypi(libraries={'..'})

@step

def start(self):

self.model = train(..)

self.customer_id = id

Create isolated namespaces for experiments and production

All executions of flows and tasks are tracked automatically

Package and persist user code and its dependencies

All modeling libraries supported, format-agnostic

Track all state, not just results

25 of 40

@step

def start(self):

self.params = list(range(100))

self.next(self.train, foreach='params')

@resources(memory=128000)

@step

def train(self):

self.model = train(...)

self.next(self.join)

@step

def join(self, inputs):

...

Compute

Run experiment at scale in the cloud

26 of 40

ML orchestrator

Security orchestrator

CI/CD orchestrator

Orchestration

So many flows!

27 of 40

ML workflows

Security workflows

CI/CD workflows

Orchestration

ML is not an island

28 of 40

Centralized common orchestrator

Defined with

CI/CD tooling

Defined with

ML-optimized

tooling

Defined with

security-optimized

tooling

Orchestration

Common infrastructure FTW!

29 of 40

Argo Workflows as a centralized orchestrator

CI/CD and Automation

Argo Workflows

ML and data science

Metaflow

Orchestration

Use what you already know!

30 of 40

Orchestration

Single-click scheduling (and back!)

31 of 40

Orchestration

Best of both worlds!

32 of 40

Orchestration

React to outside world!

@trigger(event=’new_data’)

class TrainingFlow(FlowSpec):

self.date = Parameter(‘date’)

...

@trigger(flow=’TrainingFlow’)

class InferenceFlow(FlowSpec):

@step

def start(self):

model = trigger.model

...

33 of 40

A set of Kubernetes-native tools for deploying and running applications, managing clusters, and do GitOps right.

Argo Workflows: Kubernetes-native workflow engine.
Argo Events: Event-based dependency management for Kubernetes.
Argo CD: Declarative continuous delivery with a fully-loaded UI.
Argo Rollouts: Advanced K8s progressive deployment strategies.

Argo Project

Let’s first talk about the Argo project which will be the foundation of a cloud-native machine learning workflow.

The Argo project consists of a set of kubernetes native tools for deploying and running applications and workloads on Kubernetes. It uses GitOps paradigms such as continuous delivery and progressive delivery and enables MLOps on Kubernetes. There are 4 core independent Kubernetes-native projects but many teams use different combinations of these projects to address different use cases and challenges.

The first project is the argo workflows. It's the kubernetes-native workflow engine. Argo events is the event-based dependency management for kubernetes. There’s also ArgoCD which is for declarative continuous delivery with a fully loaded UI that's needed for almost every developers in this area so that they can just take a look at the UI to see what's happening in their

clusters and watch for application deployments, etc.

Last but not least, argo rollouts project provides advanced kubernetes progressive deployment strategies.

Besides these core projects, there are many other ecosystem projects that are based on Argo, extend Argo, or work well with Argo.

34 of 40

Argo Project

200+ end user companies, 14k+ Slack members, 25k+ GitHub stars, 6k+ forks

35 of 40

Argo Project

800+ contributors
Mentoring and contributors meeting
43 core maintainers from over 10 organizations
#3 among CNCF projects, #4 among LF projects based on velocity

CNCF project rankings of developer velocity based on project activity

36 of 40

The container-native workflow engine for Kubernetes

Machine learning pipelines
Data processing/ETL
Infrastructure automation
Continuous delivery/integration

Argo Workflows

37 of 40

CRDs and Controllers

Kubernetes custom resources that natively integrates with other K8s resources (volumes, secrets, etc.)

Interfaces

CLI: manage workflows and perform operations (submit, suspend, delete/etc.)
Server: REST & gRPC interfaces
SDKs: Python, Go, and Java SDKs
UI: manage and visualize workflows, artifacts, logs, resource usages analytics, etc.

Argo Workflows

38 of 40

Supports events from 20+ event sources

Webhooks, S3, GCP PubSub, Git, Slack, etc.

Supports 10+ triggers

Kubernetes Objects, Argo Workflow, AWS Lambda, Kafka, Slack, etc.

Manage everything from simple, linear, real-time to complex, multi-source events
CloudEvents specification compliant

Argo Events

39 of 40

ML Workflow with Argo

Data ingestion

Model training

Cache store (Argo/K8s/etc.)

GitHub events (commits/PRs/tags/etc.)

The data has NOT been updated recently.

The data has already been updated recently.

Argo Events receives the events and then triggers a ML pipeline with Argo Workflow

Let's assume that we are watching events from github so for example events like issue comments, pull requests, new release tags, and so on. Argo events is responsible to watch and receive those events. Once it receives events it will trigger a machine learning pipeline defined by Argo workflows.

A simple workflow may look like this. First, there’s data ingestion step that's responsible to ingest data from the data source. You may have a cache store that's in place using Argo or Kubernetes to check whether the data data has been updated or not recently in order to skip this particular data ingestion step if nothing changes in the dataset. Otherwise, you would have to execute that data ingestion from scratch which costs a lot of computational resources. After we've ingested the data, we start model training. Model training can have multiple workers and multiple data shards depending on the selected distributed training strategy. Here for example, we are running the distributed model training step using allreduce, which we will zoom into soon. The model training may consist of code written in frameworks such as tensorflow or pytorch and then you can use Kubeflow to submit a distributed tensorflow training job so that the algorithm developers don't have to worry about the infrastructure side of things.

Kubeflow will communicate with kubernetes request necessary computational resources for each of the workers and parameter servers so that TensorFlow can just focus on the algorithms or the models.

There's also a project a sub project called Katib in Kubeflow that's useful for hyperparameter tuning, neural architecture search, early stopping and so on. Katib is written as kubernetes native resources so you can use all the tools available to build a powerful cloud native machine learning pipeline in your distributed kubernetes cluster.

40 of 40

Simplified ML Workflow with Metaflow + Argo

Data ingestion

Model training

Cache store (Argo/K8s/etc.)

GitHub events (commits/PRs/tags/etc.)

The data has NOT been updated recently.

The data has already been updated recently.

Argo Events receives the events and then triggers a ML pipeline with Argo Workflow