1 of 32

1

Sep 2019

From Zero to Airflow

bootstrapping a ML platform

2 of 32

About BlueVine

2

  • Fintech startup up based in Redwood City, CA and Tel Aviv, Israel
  • Provides working capital (loans) to small & medium sized businesses
  • Over $2 BN funded to date

3 of 32

About Me

3

  • Noam Elfanbaum, Data Engineering team lead @ BlueVine
  • A team of 4 (and recruiting!)
  • Team focus:
    • ML and data infrastructure
    • Engineering best practices for non-developers
  • Ido Shlomo created the original presentation for OSDC 2019 conference (and took a pivotal part in introducing Airflow).

4 of 32

What this presentation is about

4

Case study: Deploying a ML analytics platform into production using Apache Airflow.�

Main points:

  • The starting point

What was already in place?

  • Business goals

What did we need to achieve?

  • Initial design and plan

What did we set up?

  • Real world behavior

What went right / wrong + solutions!

  • The system in place today

Tech breakdown

  • Questions

Please nothing too hard…

5 of 32

The starting point – What was already in place?

5

Lots (and lots) of cron-jobs on a single server!

  • Every logic ran as an independent cron�
  • Every logic / cron figured out its own triggering mechanism�
  • Every logic / cron figured out its own dependencies�
  • No communication between logics

6 of 32

The starting point – What was already in place?

6

7 of 32

Business Goals – What did we need to achieve?

7

Desired

Existing

Ability to process one client end-to-end

Scope defined by # of clients in data batch

Decision within a few minutes

Over 15 minutes

Map and centrally control dependencies

Hidden and distributed dependencies

Easy and simple monitoring

Hard and confusing monitoring

Easy to scale

Impractical to scale

Efficient error recovery

“All or nothing” error recovery

8 of 32

Business Goals – What did we need to achieve?

8

Possible solutions: Lower is better!

Cronjobs

Workflow Managers

Streaming

Achievable Runtime Latency

Minutes to hours

Seconds to Minutes

Milliseconds to Seconds

Effort to Implement & Transition

Low

Medium

High

Effort to use by data teams

High

Low

Medium

9 of 32

Initial Design and Plan – What did we set up?

9

We chose Apache Airflow

Brief intro:

  • Core component is the scheduler / executor
  • Uses dedicated metadata DB to figure out current status of tasks
  • Uses workers to execute new ones
  • Webserver allows live interaction and monitoring

10 of 32

Initial Design and Plan – What did we set up?

10

DAG: Directed Acyclic Graph

  • Basically a map of tasks run in a certain dependency structure
  • Each DAG has a run frequency (e.g. every 10 seconds)
  • Both DAGs and tasks can run concurrently

11 of 32

Initial Design and Plan – What did we set up?

11

Transition:

  • Spin up Airflow alongside existing Data DBs, servers and cronjobs.
  • Translate every cronjob into DAG with one task that points to same python script (Bash Operator).
  • For each cron:
    • Turn off cronjob
    • Turn on “Singleton” DAG
  • When all crons off → Kill old servers
  • Done!

12 of 32

Initial Design and Plan – What did we set up?

12

Transition:

  • Spin up Airflow alongside existing Data DBs, servers and cronjobs.
  • Translate every cronjob into DAG with one task that points to same python script (Bash Operator).
  • For each cron:
    • Turn off cronjob
    • Turn on “Singleton” DAG
  • When all crons off → Kill old servers
  • Done!

13 of 32

Initial Design and Plan – What did we set up?

13

Design:

  • Now we have approx 200 “Singleton” DAGs
  • This does not leverage Airflows features at all
  • Next step is to expose hidden dependencies by designing complex DAGs and removing ”Singletons”

14 of 32

Initial Design and Plan – What did we set up?

14

Design:

  • Main focus of our work has been the complex “Onboarding” DAG:
    • Make funding decisions immediately after signup.
    • Ensure all requirements run in time and in the right order.
  • Rest of this talk will focus on this.
  • All other DAGs are of secondary importance.

15 of 32

Initial Design and Plan – What did we set up?

15

16 of 32

Initial Design and Plan – What did we set up?

16

Moving to a pseudo-streaming solution

Decision is made

A event is sent back to the app letting it know what to do about the user

User sign up

When a new user signup, an event is sent through Redis with the user ID.

Airflow processing

The on-boarding DAG is activated by a listener DAG, receiving the user ID. This starts the risk analysis flow.

17 of 32

Initial Design and Plan – What did we set up?

17

18 of 32

Real World Behavior – What went right / wrong + solutions!

18

The good:�

  • Transition passed smoothly�
  • UI works well (except some quirks)�
  • System is mostly stable�
  • Immediate gains seen in overall time-to-decision

The bad: �

  • No user specific access roles�
  • Scheduler can silently die

  • Tasks can become ”zombies”�
  • Scheduler performance quickly becomes a major (!) bottleneck

19 of 32

Real World Behavior – What went right / wrong + solutions!

19

Problem: Bloated Airflow DB

  • Big DB → slower queries → slower scheduling & execution
  • DB contains metadata for all dag / task runs
  • High dag frequency + many dags + many tasks == many rows
  • Under our setup, within first two months, the DB:
    • Had over 4 BN calls
    • Was over 15 GB in size �

Solution: Run a weekly archive of data older than 1 week.

20 of 32

Real World Behavior – What went right / wrong + solutions!

20

Problem: Inefficient querying mechanism

  • OB Dag has 40 tasks with 20 parallel runs, so scheduler does ~800 (!) queries every pass just for this one Dag.
  • Then there are all the other ~200 Dags...

Solution:

  • Instead of a query per task per Dag run, make query per Dag run.
  • This is our humble contribution to the Airflow source code: https://github.com/apache/airflow/pull/4751.

21 of 32

Real World Behavior – What went right / wrong + solutions!

21

Overall Results:�

  • Average scheduling delay between tasks decreased by ~50%: from 1.5 sec to 0.8 sec.�
  • Max delay between tasks decreased by ~80%: from 6.4 sec to 1.3 sec.

22 of 32

Real World Behavior – What went right / wrong + solutions!

22

Problem: Scheduler overloaded

  • Scheduler has to continually parse all DAGs
  • Many dags + many tasks == lots to parse

Solution: Strengthen scheduler instance

  • Airflow supports parallel parsing
  • Strong instance → more processes → faster scheduling

23 of 32

Real World Behavior – What went right / wrong + solutions!

23

Problem: Scheduler can’t prioritize

  • Scheduler has to continually parse all DAGs
  • Not all DAGs are equally important
  • But all are given the same scheduling resources

Solution: Spin up a 2nd Airflow just for time-sensitive processes!

  • Servers are cheap, time is expensive
  • Dedicated instance → less dags / tasks → faster scheduling

24 of 32

Real World Behavior – What went right / wrong + solutions!

24

Overall results:�

  • OB DAG can never be “starved” for resources due to competition from other DAGs.�
  • Approx 30% reduction in average end-to-end OB flow runtime.�
  • Approx 60% reduction in average time spent on transitions between tasks.

25 of 32

Real World Behavior – What went right / wrong + solutions!

25

Airflow updates are already addressing some of the issues that we found!

26 of 32

The system in place today – Tech Breakdown

26

27 of 32

The system in place today – Tech Breakdown

27

SLA Highlights:

  • Overall runtime is under 3 minutes for 95% of the cases

  • Any given task runs under 1 minute for 95% of the cases

  • Time between dependent tasks is under 3 seconds

28 of 32

The system in place today – Tech Breakdown

28

Data division of labor:

  • DS owns models & analytics
  • DS owns workflow logic via PR to DE
  • DE owns workflow implementation via PR by DS
  • DE owns Airflow settings and architecture
  • DO owns Airflow implementation via PR by DE
  • DO owns source-of-truth operational DBs and repos

29 of 32

The system in place today – Tech Breakdown

29

DS:�Define logic independently

30 of 32

The system in place today – Tech Breakdown

30

DS PR to DE:

Adding new logic to Airflow

31 of 32

The system in place today – Tech Breakdown

31

DS PR to DE:

Adding new logic to Airflow

32 of 32

Questions? + Thanks

(+anyone looking for a job?)

@noamelf