1 of 40

2 of 40

Savin Goyal, CTO at Outerbounds

Yuan Tang, Founding Engineer at Akuity & Argo Project Lead

Beyond Prototypes: Production-Ready ML Systems with Metaflow and Argo

3 of 40

4 of 40

Beyond Prototypes

Building

Production Ready

ML Systems

5 of 40

Build a Cathedral?

  • Decide and plan a specific deliverable
  • Allocate a large amount of time and resources
  • Prepare to spend a long time building it

Results can be gorgeous but are fundamentally static and hard to change.

Also, you can’t afford building too many of these.

6 of 40

Set up a Bazaar?

  • Invite people to share their best ideas and work together as a loose collective.
  • Low initial investment, easy to get started.
  • Quick to experiment with ideas, react to changing demands.

The results can be exciting but limited in impact due to lack of coordinated, goal-oriented investment.

Also, hard to maintain results in the long-term.

7 of 40

Cathedral-style

Bazaar-style

How to engineer systems powered by ML?

Too constrained

Too expensive

Lack of experimentation

Results too static

Too chaotic

Unpredictable results

Little cumulative progress

Limited long-term value

Farm-style!

Plan ahead, react to results

Grow investment as needed

Iterate on the best ideas

Produces long-term value

8 of 40

Establish an ML farm!

  • Culture of experimentation
    • Make it easy to test diverse ideas
    • Nurture them enough that you start seeing actual value (or not)
    • Double-down on ideas that work
  • The best ML projects never end
    • Keep fixing, iterating, and improving
    • Deliver value year after year

Common tooling & infrastructure saves time & effort, makes it easier to collaborate

9 of 40

Valuable

System

Engineer

runs reliably without human supervision

produces correct results

Traditional Software

10 of 40

Valuable

System

Engineer

Data Scientist

ML Model

Data

New

New

New

ML-powered Software

11 of 40

Valuable

System

Engineer

Data Scientist

ML Model

Data

New

New

New

produces correct results

runs reliably without human supervision

ML-powered Software

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

How much infrastructure is needed

How much data scientist cares

20 of 40

21 of 40

class MyFlow(FlowSpec):

@step

def start(self):

import pandas as pd

self.df = pd.DataFrame(big_one)

self.next(self.end)

@step

def end(self):

self.model = train(self.df)

Workflows

Define workflows with a human-friendly syntax

22 of 40

class QueryFlow(FlowSpec):

@step

def query(self):

self.ctas = "CREATE TABLE %s AS %s" % (self.table, self.sql)

query = wr.athena.start_query_execution(self.ctas)

output = wr.athena.wait_query(query)

loc = output['ResultConfiguration']['OutputLocation']

with metaflow.S3() as s3:

results = [obj.url for obj in s3.list_recursive([loc])

Data

Comes with tools for fast-data access

23 of 40

class MyFlow(FlowSpec):

@step

def start(self):

self.alpha = 0.5

self.next(self.train)

@step

def train(self):

self.model = train_model(self.alpha)

Versioning

Everything gets versioned automatically

24 of 40

Tracking & namespaces

Track everything by default

@card

@project(name='LTV')

class TrainingFlow(FlowSpec):

@pypi(libraries={'..'})

@step

def start(self):

self.model = train(..)

self.customer_id = id

Create isolated namespaces for experiments and production

All executions of flows and tasks are tracked automatically

Package and persist user code and its dependencies

All modeling libraries supported, format-agnostic

Track all state, not just results

25 of 40

@step

def start(self):

self.params = list(range(100))

self.next(self.train, foreach='params')

@resources(memory=128000)

@step

def train(self):

self.model = train(...)

self.next(self.join)

@step

def join(self, inputs):

...

Compute

Run experiment at scale in the cloud

26 of 40

ML orchestrator

Security orchestrator

CI/CD orchestrator

Orchestration

So many flows!

27 of 40

ML workflows

Security workflows

CI/CD workflows

Orchestration

ML is not an island

28 of 40

Centralized common orchestrator

Defined with

CI/CD tooling

Defined with

ML-optimized

tooling

Defined with

security-optimized

tooling

Orchestration

Common infrastructure FTW!

29 of 40

Argo Workflows as a centralized orchestrator

CI/CD and Automation

Argo Workflows

ML and data science

Metaflow

Orchestration

Use what you already know!

30 of 40

Orchestration

Single-click scheduling (and back!)

31 of 40

Orchestration

Best of both worlds!

32 of 40

Orchestration

React to outside world!

@trigger(event=’new_data’)

class TrainingFlow(FlowSpec):

self.date = Parameter(‘date’)

...

@trigger(flow=’TrainingFlow’)

class InferenceFlow(FlowSpec):

@step

def start(self):

model = trigger.model

...

33 of 40

A set of Kubernetes-native tools for deploying and running applications, managing clusters, and do GitOps right.

  • Argo Workflows: Kubernetes-native workflow engine.
  • Argo Events: Event-based dependency management for Kubernetes.
  • Argo CD: Declarative continuous delivery with a fully-loaded UI.
  • Argo Rollouts: Advanced K8s progressive deployment strategies.

Argo Project

34 of 40

Argo Project

200+ end user companies, 14k+ Slack members, 25k+ GitHub stars, 6k+ forks

35 of 40

Argo Project

  • 800+ contributors
  • Mentoring and contributors meeting
  • 43 core maintainers from over 10 organizations
  • #3 among CNCF projects, #4 among LF projects based on velocity

36 of 40

The container-native workflow engine for Kubernetes

  • Machine learning pipelines
  • Data processing/ETL
  • Infrastructure automation
  • Continuous delivery/integration

Argo Workflows

37 of 40

CRDs and Controllers

  • Kubernetes custom resources that natively integrates with other K8s resources (volumes, secrets, etc.)

Interfaces

  • CLI: manage workflows and perform operations (submit, suspend, delete/etc.)
  • Server: REST & gRPC interfaces
  • SDKs: Python, Go, and Java SDKs
  • UI: manage and visualize workflows, artifacts, logs, resource usages analytics, etc.

Argo Workflows

38 of 40

  • Supports events from 20+ event sources
    • Webhooks, S3, GCP PubSub, Git, Slack, etc.
  • Supports 10+ triggers
    • Kubernetes Objects, Argo Workflow, AWS Lambda, Kafka, Slack, etc.
  • Manage everything from simple, linear, real-time to complex, multi-source events
  • CloudEvents specification compliant

Argo Events

39 of 40

ML Workflow with Argo

Data ingestion

Model training

Cache store (Argo/K8s/etc.)

GitHub events (commits/PRs/tags/etc.)

The data has NOT been updated recently.

The data has already been updated recently.

Argo Events receives the events and then triggers a ML pipeline with Argo Workflow

40 of 40

Simplified ML Workflow with Metaflow + Argo

Data ingestion

Model training

Cache store (Argo/K8s/etc.)

GitHub events (commits/PRs/tags/etc.)

The data has NOT been updated recently.

The data has already been updated recently.

Argo Events receives the events and then triggers a ML pipeline with Argo Workflow