Savin Goyal, CTO at Outerbounds
Yuan Tang, Founding Engineer at Akuity & Argo Project Lead
Beyond Prototypes: Production-Ready ML Systems with Metaflow and Argo
Beyond Prototypes
Building
Production Ready
ML Systems
Build a Cathedral?
Results can be gorgeous but are fundamentally static and hard to change.
Also, you can’t afford building too many of these.
Set up a Bazaar?
The results can be exciting but limited in impact due to lack of coordinated, goal-oriented investment.
Also, hard to maintain results in the long-term.
Cathedral-style
Bazaar-style
How to engineer systems powered by ML?
Too constrained
Too expensive
Lack of experimentation
Results too static
Too chaotic
Unpredictable results
Little cumulative progress
Limited long-term value
Farm-style!
Plan ahead, react to results
Grow investment as needed
Iterate on the best ideas
Produces long-term value
Establish an ML farm!
Common tooling & infrastructure saves time & effort, makes it easier to collaborate
Valuable
System
Engineer
runs reliably without human supervision
produces correct results
Traditional Software
Valuable
System
Engineer
Data Scientist
ML Model
Data
✨ New✨
✨ New✨
✨ New✨
ML-powered Software
Valuable
System
Engineer
Data Scientist
ML Model
Data
✨ New✨
✨ New✨
✨ New✨
produces correct results
runs reliably without human supervision
ML-powered Software
How much infrastructure is needed
How much data scientist cares
class MyFlow(FlowSpec):
@step
def start(self):
import pandas as pd
self.df = pd.DataFrame(big_one)
self.next(self.end)
@step
def end(self):
self.model = train(self.df)
Workflows
Define workflows with a human-friendly syntax
class QueryFlow(FlowSpec):
@step
def query(self):
self.ctas = "CREATE TABLE %s AS %s" % (self.table, self.sql)
query = wr.athena.start_query_execution(self.ctas)
output = wr.athena.wait_query(query)
loc = output['ResultConfiguration']['OutputLocation']
with metaflow.S3() as s3:
results = [obj.url for obj in s3.list_recursive([loc])
Data
Comes with tools for fast-data access
class MyFlow(FlowSpec):
@step
def start(self):
self.alpha = 0.5
self.next(self.train)
@step
def train(self):
self.model = train_model(self.alpha)
Versioning
Everything gets versioned automatically
Tracking & namespaces
Track everything by default
@card
@project(name='LTV')
class TrainingFlow(FlowSpec):
@pypi(libraries={'..'})
@step
def start(self):
self.model = train(..)
self.customer_id = id
Create isolated namespaces for experiments and production
All executions of flows and tasks are tracked automatically
Package and persist user code and its dependencies
All modeling libraries supported, format-agnostic
Track all state, not just results
@step
def start(self):
self.params = list(range(100))
self.next(self.train, foreach='params')
@resources(memory=128000)
@step
def train(self):
self.model = train(...)
self.next(self.join)
@step
def join(self, inputs):
...
Compute
Run experiment at scale in the cloud
ML orchestrator
Security orchestrator
CI/CD orchestrator
Orchestration
So many flows!
ML workflows
Security workflows
CI/CD workflows
Orchestration
ML is not an island
Centralized common orchestrator
Defined with
CI/CD tooling
Defined with
ML-optimized
tooling
Defined with
security-optimized
tooling
Orchestration
Common infrastructure FTW!
Argo Workflows as a centralized orchestrator
CI/CD and Automation
Argo Workflows
ML and data science
Metaflow
Orchestration
Use what you already know!
Orchestration
Single-click scheduling (and back!)
Orchestration
Best of both worlds!
Orchestration
React to outside world!
@trigger(event=’new_data’)
class TrainingFlow(FlowSpec):
self.date = Parameter(‘date’)
...
@trigger(flow=’TrainingFlow’)
class InferenceFlow(FlowSpec):
@step
def start(self):
model = trigger.model
...
A set of Kubernetes-native tools for deploying and running applications, managing clusters, and do GitOps right.
Argo Project
Argo Project
200+ end user companies, 14k+ Slack members, 25k+ GitHub stars, 6k+ forks
Argo Project
The container-native workflow engine for Kubernetes
Argo Workflows
CRDs and Controllers
Interfaces
Argo Workflows
Argo Events
ML Workflow with Argo
Data ingestion
Model training
Cache store (Argo/K8s/etc.)
GitHub events (commits/PRs/tags/etc.)
The data has NOT been updated recently.
The data has already been updated recently.
Argo Events receives the events and then triggers a ML pipeline with Argo Workflow
Simplified ML Workflow with Metaflow + Argo
Data ingestion
Model training
Cache store (Argo/K8s/etc.)
GitHub events (commits/PRs/tags/etc.)
The data has NOT been updated recently.
The data has already been updated recently.
Argo Events receives the events and then triggers a ML pipeline with Argo Workflow