1
Sep 2019
From Zero to Airflow
bootstrapping a ML platform
About BlueVine
2
About Me
3
What this presentation is about
4
Case study: Deploying a ML analytics platform into production using Apache Airflow.�
Main points:
| What was already in place? |
| What did we need to achieve? |
| What did we set up? |
| What went right / wrong + solutions! |
| Tech breakdown |
| Please nothing too hard… |
The starting point – What was already in place?
5
Lots (and lots) of cron-jobs on a single server!
The starting point – What was already in place?
6
Business Goals – What did we need to achieve?
7
Desired | Existing |
Ability to process one client end-to-end | Scope defined by # of clients in data batch |
Decision within a few minutes | Over 15 minutes |
Map and centrally control dependencies | Hidden and distributed dependencies |
Easy and simple monitoring | Hard and confusing monitoring |
Easy to scale | Impractical to scale |
Efficient error recovery | “All or nothing” error recovery |
Business Goals – What did we need to achieve?
8
Possible solutions: Lower is better!
| Cronjobs | Workflow Managers | Streaming |
Achievable Runtime Latency | Minutes to hours | Seconds to Minutes | Milliseconds to Seconds |
Effort to Implement & Transition | Low | Medium | High |
Effort to use by data teams | High | Low | Medium |
Initial Design and Plan – What did we set up?
9
We chose Apache Airflow
Brief intro:
Initial Design and Plan – What did we set up?
10
DAG: Directed Acyclic Graph
Initial Design and Plan – What did we set up?
11
Transition:
Initial Design and Plan – What did we set up?
12
Transition:
Initial Design and Plan – What did we set up?
13
Design:
Initial Design and Plan – What did we set up?
14
Design:
Initial Design and Plan – What did we set up?
15
Initial Design and Plan – What did we set up?
16
Moving to a pseudo-streaming solution
Decision is made
A event is sent back to the app letting it know what to do about the user
User sign up
When a new user signup, an event is sent through Redis with the user ID.
Airflow processing
The on-boarding DAG is activated by a listener DAG, receiving the user ID. This starts the risk analysis flow.
Initial Design and Plan – What did we set up?
17
Real World Behavior – What went right / wrong + solutions!
18
The good:�
The bad: �
Real World Behavior – What went right / wrong + solutions!
19
Problem: Bloated Airflow DB
Solution: Run a weekly archive of data older than 1 week.
Real World Behavior – What went right / wrong + solutions!
20
Problem: Inefficient querying mechanism
Solution:
Real World Behavior – What went right / wrong + solutions!
21
Overall Results:�
Real World Behavior – What went right / wrong + solutions!
22
Problem: Scheduler overloaded�
Solution: Strengthen scheduler instance
Real World Behavior – What went right / wrong + solutions!
23
Problem: Scheduler can’t prioritize�
Solution: Spin up a 2nd Airflow just for time-sensitive processes!
Real World Behavior – What went right / wrong + solutions!
24
Overall results:�
Real World Behavior – What went right / wrong + solutions!
25
Airflow updates are already addressing some of the issues that we found!
The system in place today – Tech Breakdown
26
The system in place today – Tech Breakdown
27
SLA Highlights:
The system in place today – Tech Breakdown
28
Data division of labor:
The system in place today – Tech Breakdown
29
DS:�Define logic independently
The system in place today – Tech Breakdown
30
DS PR to DE:
Adding new logic to Airflow
The system in place today – Tech Breakdown
31
DS PR to DE:
Adding new logic to Airflow