1 of 154

2 of 154

📣 加入 Airflow Taiwan 社群 slack

① 加入 Airflow 官方 slack 群組:https://apache-airflow-slack.herokuapp.com/

② 搜尋 #users-taiwan 頻道

3 of 154

4 of 154

你是否有遇過以下情況?

  • 不同團隊產出相同的報表但 metric 定義卻不一樣
  • 不同團隊不約而同產出了幾乎相同的 table 或 dataset
  • 和同事討論同個 metric 卻像在雞同鴨講(你的 DAU 不是我的 DAU?)

5 of 154

進行方式

以討論形式舉辦,每個人都有 10 分鐘時間可以發言,可以任何方法呈現,例如投影片、口述等...

  • (2min) 背景介紹:自我介紹、所在的團隊、團隊相關的 stackholders 等...
  • (3min) 你遇到的問題
  • (5min) 你如何解決/你打算如何解決

6 of 154

7 of 154

這是看不到的投影片

  1. 這張投影片每次聚會都會用
  2. 每一次聚會從頭加入新投影片
  3. 歡迎大家插入一到兩張想分享的資料或是問題

8 of 154

2023-12-07 Agenda

- 舉辦形式是否需要調整?例如新增讀書會/workshop? - Tai-Wei Huang

- 如何擴大邀請講者 - Tai-Wei Huang

- 爭取官方贊助? (貼紙) - Damon

- 更有組織的分工? (活動, 宣傳 …) - Damon

9 of 154

10 of 154

Taiwan Airflow Meetup

2023.11.09 @ Dcard Headquarters

11 of 154

Data validation and testing by pandera

wg

12 of 154

Taiwan Airflow Meetup

2023.10.05 @ Dcard Headquarters

13 of 154

Taiwan Airflow Meetup

2023.09.07 @ Dcard Headquarters

14 of 154

DataHub: 淺談導入Metadata 平台的功能與展望

蔡睿峰 Michael JF Tsai

15 of 154

About Me

  • Data Engineer @ E.Sun Bank
    • Data platform team at Intelligent Division
    • Build up and maintain data pipelines
    • Provide data platforms to data user
      • Airflow
      • DataHub
    • Survey new data tech
  • Past
    • National Chengchi University

15

16 of 154

Outline

  • Data flow
  • Why we need metadata platform ?
    • Problems to DE
    • Problems to DA/DS
  • What DataHub helps us ?
    • Datahub features
    • How to solve problems
  • Feature works

16

17 of 154

Data Flow

  • 1 data engineer team

17

18 of 154

Data Flow

  • Many data analyst/scientist teams

18

19 of 154

Stack Holder

19

Data engineer team

Data analyst/scientist teams

20 of 154

Problems to DE

  • Data sources change their schema or have incidents then can not to be updated

  • Data analyst/scientist teams apply their data pipeline requirements to data engineer team through forms

  • 1 data sets can be used by many data analyst/scientist teams

20

21 of 154

Problems to DA/DS

  • Dont know which teams apply for specific data sources

  • Bad experience in data explorations

  • It’s hard for a newbie to understand the relationships of data in a DA/DS team

21

22 of 154

Problems in Financial industry

  • We are under Financial Institutions Supervision

  • Vary careful of personal information

  • Permission management

22

23 of 154

DataHub

23

https://datahubproject.io/docs/architecture/architecture

24 of 154

DataHub POC Setup

  • DataHub version 0.9.5

  • Use docker-compose to setup DataHub

  • Deploy only in analysis environment

  • 6000+ datasets
    • DE 3900+
    • DA/DS 2200+

24

25 of 154

DataHub Features

  • Search and discovery

  • View metadata of data

  • Data lineage

  • Authorization

  • Data management

25

26 of 154

DataHub Features

  • Search and discovery

26

27 of 154

DataHub Features

  • View metadata of data

27

28 of 154

DataHub Features

  • Data lineage

28

29 of 154

DataHub Features

  • Authorization
    • Role and user group
      • Admin
      • Editor
      • Reader
    • Platform or Metadata privileges
      • Platform level: managing users & groups, managing policies
      • Metadata: edit dataset documentation, links, owners, tags

29

30 of 154

DataHub Features

  • Data management
    • Glossary terms
      • Formal and shared vocabulary in an organization
    • Tags
      • informal, loosely controlled labels that help in search & discovery.
    • Owner
      • Technical owner
      • Business owner
      • Data steward

30

31 of 154

Problems to DE

  • Data sources change their schema or have incidents then can not to be updated

  • Data analyst/scientist teams apply their data pipeline requirements to data engineer team through forms

  • 1 data set can be used by many data analyst/scientist teams�Data Lineage, Data owners features

31

32 of 154

Problems to DA/DS

  • Dont know which teams apply for specific data sources

  • Bad experience in data explorations

  • It’s hard for a newbie to understand the relationships of data in a DA/DS team

Data owner, data discovery, view metadata of data, data lineage

32

33 of 154

Problems in Financial industry

  • We are under Financial Institutions Supervision

  • Vary careful of personal information

  • Permission management

Data management, data owner, Authorization

33

34 of 154

Feature Work

  • Survey lineage tools ( OpenLineage ) to setup data lineages

  • Integrate with data quality tools (Great Expectations)

  • Metrics, Metrics, Metrics

34

35 of 154

Thanks for listening��Q&A

35

智慧財產權聲明

本資料各項內容之各項權利及智慧財產權(包括但不限於著作權、專利權、商標權等)均屬玉山金融控股股份有限公司及其子公司(以下簡稱「玉山金控」)所有。除非獲得玉山金控事前書面同意外,均不得擅自以任何形式複製、重製、修改、發行、上傳、張貼、傳送、散佈、公開傳播、販售或其他非法使用本資料。除非有明確表示,本資料之提供並無明示或暗示授權貴方任何著作權、專利權、商標權、商業機密或任何其他智慧財產權。

Intellectual Property Rights

The rights and the intellectual property rights (including but not limited to the copyrights, patents and trademarks, and etc.) of the Material belongs to E.SUN Financial Holding Co., Ltd. and its subsidiaries (hereinafter referred to as “E.SUN”). Any copy, reproduction, modification, upload, post, distribution, transmission, sale or illegal usage of the Material in any way shall be strictly prohibited without the prior written permission of E.SUN. Except as expressly provided herein, E.SUN does not, in providing this Material, grant any express or implied right to you under any patents, copyrights, trademarks, trade secret or any other intellectual property rights.

36 of 154

Datahub - Dcard

Damon

37 of 154

Comparison of Amundsen and Datahub

38 of 154

Admin

Tool

Datahub

Amundsen

Monitoring

Yes

No

Authorization

Metadata polices

In the roadmap

Ingestion

Pulgin based

ETL based

Compoments Management

Medium

Easy

Metadata Management

Medium

Easy

Metadata Architectures

Stream + API: Push

Crawl Based

Backend

Java

Python

39 of 154

User

Tool

Datahub

Amundsen

Discovery

Overview,

Domains,

Glossary Terms,

Data Product

Tags

Lineage

Table, Column

Table, Column

Metadata

Schema and column version, Historical stats, Addtional properties

Current stats, Table and dashboard relationships

40 of 154

The features for workflow

41 of 154

Github actions

This use case provide ideas of wanring and showing effection of target changes.

  • Demo (Validate the DBT column changes)
  • Source Code

42 of 154

Glosary terms sync action

This use case provide ideas to monitoring the neccery updation of the target chnages.

  • Demo
  • Source Code

43 of 154

Datahub action framework

44 of 154

Monitoring data pipelines

45 of 154

Pipelines Monitoring - Dcard

Damon

46 of 154

Monitoring metrics

  • Status of Airflow Cluster
  • Status of Raw data
  • Data detail of failure task

47 of 154

Airflow monitoring

48 of 154

Architecture for Airflow monitoring

  • Airflow statsd metrics
  • Grafana
  • Prometheus

49 of 154

Send P0 notification when the Airflow cluster…

  • Queued tasks per pool > X been Y mins
  • Worker restarts > X in Y mins
  • Scheduler queues delay > X mins

50 of 154

Send P1 notification when the Airflow cluster…

  • Failures per core operator > X percent
  • DAGs import error > 0 in any serices

51 of 154

Raw data monitoring

52 of 154

Architecture for raw data monitoring

  • Great Expectations
  • Endpoint for HTML accessment
  • Testing cases
  • Talk to the stakeholders

53 of 154

GE UI

Runs overview

54 of 154

GE UI

The run overview

55 of 154

GE UI

The suite information

56 of 154

Testing

  • expect_column_values_to_be_in_set (auto or manually)
  • expect_column_values_to_not_be_null
  • expect_column_values_to_be_unique
  • expect_table_row_count_to_be_between

57 of 154

Talk to the stakeholders

  • Data Owner
  • Workflow when testing failed
  • Contribure easily

58 of 154

Failure task monitoring

59 of 154

Data detail of failure task

  • Crawler for the BigQuery failure jobs
  • BigQueryValueCheckOperator

60 of 154

Airflow Meetup in October

https://www.meetup.com/taipei-py/events/295948520/?isFirstPublish=true

🏠地點:TBD�📆時間:2023/10/5 (四) 19:15 入場;19:30 - 21:30�📌主題:

  1. After Data Pipeline - Jia-Xuan Liu & 冠博 From Canner Data
  2. 資料清理的過程不外乎是ETL或是ELT架構。但你是否想過在資料流終點後面的事情?資料工程師時常為了符合各種的資料需求,建構許多一次性或中間表。在長期的累積下,資料庫容易變得雜亂且不亦管理。語意層(Semantic Layer),便為了解決這件事情而誕生了。
  3. 持續徵求講者!
  4. 歡迎大家各抒己見,可先於 slide 上分享 1~2 頁內容以利討論

61 of 154

Taiwan Airflow Meetup

2023.07.06 @ Dcard Headquarters

62 of 154

63 of 154

64 of 154

65 of 154

66 of 154

67 of 154

68 of 154

69 of 154

70 of 154

71 of 154

72 of 154

73 of 154

Taiwan Airflow Meetup

2023.06.01 @ Dcard Headquarters

74 of 154

📣 COSCUP PyCon 社群軌徵稿中!

75 of 154

Taiwan Airflow Meetup

2023.05.11 @ Dcard Headquarters

76 of 154

77 of 154

粗暴的資料處理

Alex Hsieh (DouEnergy)

Airflow Taiwan User Meetup #4, 2023 May

78 of 154

79 of 154

What is DuckDB ?

  • In-Process OLAP DBMS

  • “The SQLite for Analytics”

  • Free and Open Source (MIT)

80 of 154

81 of 154

"If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating"

Hadley Wickham ( Chief Scientist at RStudio)

82 of 154

83 of 154

Bro! I dot car 🚗 any new database 😵‍💫

  • Over 300+ Database on DB-Engines Ranking

  • If your db isn't that lit, don't bother inviting me

84 of 154

Source: Peter Boncz

85 of 154

Why DuckDB ?

  • Easy

  • Fast

  • Open

86 of 154

Easy

  • `pip install duckdb` is all you need

  • CSV , Parquet , JSON

  • Dataframe

  • Http , S3 , Minio …

  • Friendlier SQL

  • Talk is cheap show me the code (google colab)

87 of 154

Easy

88 of 154

Fast Ritchie Vink(polars)

89 of 154

Fast duckdblab

90 of 154

Open

  • Code

  • Mind

  • Community

91 of 154

92 of 154

How to test dags?

Tai-Wei Huang

93 of 154

94 of 154

Taiwan Airflow Meetup

2023.04.13 @ Dcard Headquarters

95 of 154

96 of 154

97 of 154

98 of 154

Survey Workflow Management Framework

Tai-Wei Huang

99 of 154

Candidates

註:以下比較版本以 latest 版為主

2011 年發佈,以 Java 構成,老牌開源自動化處理工具,目前主要應用以 CI/CD 為主,目前 stable 版本為 2.387.2 LTS

2015 年發佈,以 Python 構成的排程管理工具,目前主要應用以 data processing, ETL, ML training 等為主,目前版本為 v2.5.3

2018 年發佈,以 Python 構成的排程管理工具,目前主要應用以 data processing, ETL, ML training 等為主,目前版本為 v2.10.2

2018 年發佈,container-native 工作流程管理引擎,提供編排執行於 Kubernetes 的 job,general purpose workflow tools,目前到 v3.4.6

100 of 154

Workflow Definition Language

  • Groovy
  • Python
  • Python
  • YAML
  • Actually we can use Go to dynamically define Workflow structs

101 of 154

Calendar

  • Only cron
  • Cron expression
  • Time deltas
  • Timetables
  • 支援多種時間觸發方式,能自定義日曆

  • Schedules
    1. cron
    2. interval
    3. rrule
  • 支援多種時間觸發方式,但不支援自定義日曆
  • Argo Event - calendar
    • cron
    • time interval
  • 自定義日曆較麻煩,須透過 exclusionDates hardcode 排除日期來達成自定義日曆

102 of 154

Calendar

103 of 154

Calendar

104 of 154

Backfill

  • Not support
  • Built-in backfill by start_date and schedule_interval parameter.
  • Not support. But you can trigger by API and using context or Parameter to specify the date parameter.
  • cron-backfill,較原始需要自行實作許多功能

105 of 154

Calendar

106 of 154

Event trigger

  • Generic Webhook Trigger
  • Prefect REST API
  • Argo Event,支援 20+ 種不同來源

107 of 154

Context in Workflow

  • Not support
  • 支援豐富的時間參數與 workflow metadata
  • 亦支援 jinja templates
  • prefect.context 支援豐富的時間參數與 workflow metadata
  • Not support, 只能透過 parameters 自行實作

108 of 154

Context in Workflow

109 of 154

Context in Workflow

110 of 154

Dynamic Workflow

  • Not support
  • Dynamic Task Mapping
  • Mapping
  • Loops

111 of 154

Operators

  • Jenkins plugins
  • 不支援 data stacks 操作

  • Built-in library of operators
  • 支援許多 3rd party data stacks 操作
  • Built-in library of operator tasks
  • 支援許多 3rd party data stacks 操作
  • Argo workflow plugin
  • 不支援 3rd party data stacks 操作

112 of 154

Operators

113 of 154

Operators

114 of 154

Operators

115 of 154

Data Transfer Between Jobs

  • Only boolean or customize by developer
  • XComs (serializable), stored in DB
  • (serializable), stored in local file system or object storage
  • Output Parameters & Artifacts, stored in object storage

116 of 154

Data Transfer Between Jobs

117 of 154

Data Transfer Between Jobs

118 of 154

Job Queue Priority

  • Priority Sorter plugin
  • pool
  • priority_weights
  • work-pool
  • work-queue
  • 無法直接決定調度先後
  • priority
  • priorityClassName
  • 透過 k8s 本身機制去調整資源優先與搶佔級別

119 of 154

Job Queue Priority

120 of 154

Job Queue Priority

121 of 154

Installation

  • Agent based
  • support helm chart
  • Agent-less
  • support helm chart
  • Agent based
  • support helm chart
  • As a K8s CRD
  • support helm chart

122 of 154

Kubernetes

  • Plugin
  • create k8s agent
  • K8s executor
  • K8s operator
  • prefect_kubernetes
  • Run as K8s CRD

123 of 154

Maintenance - Scale

  • Medium

  • Medium, depend on executor
  • Medium, depend on executor
  • Easy. Container-base

124 of 154

Maintenance - Operation

  • Medium
  • Not support backfill
  • Easy
  • Built-in backfill
  • Easy
  • Beautiful UI
  • Medium.
  • Backfill is not convenient

125 of 154

Maintenance - Debug

  • Easy

  • Hard
  • Hard. Due to the complex architecture
  • Easy

126 of 154

Maintenance - Upgrade

  • Easy

  • Hard to deploy and upgrade.
  • Cause the DAG files have lots of dependency with airflow, packages and python version…
  • Hard to deploy and upgrade.
  • The DAG files have lots of dependency with Prefect, packages and python version…
  • Easy. Because in most use case the business logic and DAGs are separated

127 of 154

Monitoring

  • Prometheus plugin for Jenkins
  • Not support
  • Prometheus Metrics,可直接定義於 Workflow templates 中

128 of 154

Pros

Cons

  • Data intensive job
  • Python user
  • Pythonic DAG code
  • Rich operator and 3rd party integration. Especially data and ML related.
  • Beautiful UI
  • Upgrade is a nightmare.
  • Hard to trouble shooting.
  • Some Docs are ambiguous.

129 of 154

Alerting

  • Email, slack
  • Email, callbacks,可自行實作行為,例如:發送失敗訊息至 slack、Opsgenie 等
  • Notifications,支援多種 channel
  • Workflow Notifications,需自定義 exit handler 與行為

130 of 154

Pros

Cons

  • lightweight
  • simple job, quick start
  • java user
  • reliable
  • it’s great for CI/CD use, but not suite for data pipeline

131 of 154

Pros

Cons

  • Data intensive job
  • Python user
  • Rich trigger method
  • Rich operator and 3rd party integration. Especially data and ML related.
  • Upgrade is a nightmare.
  • Not easy to debug.

132 of 154

Pros

Cons

  • Easy to integrate with K8s
  • DAG and business logic code are low coupling (If you don’t use Go to dynamically define Workflow structs)
  • Can leverage Argo Event.
  • Define monitoring metrics in YAML is very powerful.
  • Support running daemon containers. Reuse some services in workflow per running.
  • Less scheduling methods compared to other workflow management tool.
  • Not support data processing operator, but maybe it will support someday.
  • Because Argo workflow is a general purpose workflow managements tool. Argo workflow does not integrate with 3rd party data stacks (including data quality tools, data lineage tools, data governance tools…)

133 of 154

Dependency Management

  • Local and Production should share same requirements
  • allow extra for local and production

## PIP tool

https://github.com/jazzband/pip-tools#workflow-for-layered-requirements

134 of 154

production example

# requirements.in

airflow==2.0.2

$ pip-compile # produce a requirements.txt

# requirements.txt

airflow ==2.0.2

# via -r requirements.in

pandas==1.5

# via airflow

135 of 154

local example

# dev-requirements.in

-c requirements.txt

pytest

$ pip-compile dev-requirements.in

pytest==7.1.2

136 of 154

install

137 of 154

2023/03/02

138 of 154

Airflow 升級經驗分享

Damon

139 of 154

Context

  • 2.3.4 from 2.1.4
  • Database 300 GB
  • CeleryKunbernetesExecutor

140 of 154

Work Flow

Upgrade Plan

Testing Cluster

DAG Updates

DB Migration

DAG Testing

Server Testing

Staging Cluster

DAG Updates

DB Migration

DAG Testing

Server Testing

Prod Cluster

DAG Updates

DB Migration

DAG Testing

Server Testing

Intro the work flow

Intro the new features

141 of 154

Notes after the upgrade

DB migration took a very long time

現象:The duration of the production DB migration was underestimated, it cause all of the pipelines blocked for 5 hours.

解決:Consider DB migration with zero downtime

Report DAGs runtime error

現象:The DAGs was runtime error after the upgrade because the data type was changed due to version of pandas library changes.

解決:Use the Python virtual environment for the tasks

142 of 154

Testing Cluster

  • 1% DAGs 完成測試
  • 60% Operators 完成測試
  • 整個過程順暢

143 of 154

Prod Cluster

  • DB migration 花了 5 個小時
    • 跟之前差很多的原因:
      • 資料量差了 15 倍 (20 GB <> 300 GB)
  • 沒測到的 DAG 各種炸裂
    • 原因:
      • Airflow 環境變數行為的改變
      • Package 的版本改變

144 of 154

Staging Cluster

  • 50% DAGs 完成測試
  • 90% Operators 完成測試
  • 遇到 Bug:
    • 與 CeleryKubernetesExecutor 有關
    • 在 Testing Cluster 沒發現的原因:
      • 因為沒有同時開啟 2 個以上有用到 K8S 執行的 Dynamic DAGs
  • 應對措施:降到 2.3.3
    • 因為看到 Airflow docker hub 上面的 stable version 是此版本

145 of 154

Improvement ideas

  • Migration with zero downtime
    • Metadata volume
    • Swtiching in mutiple database
    • Workflow to prevent duplicate data
  • Data for testing
  • Unit testing
    • Unit testing framework
  • Package encapsulation

146 of 154

2023/02/02

147 of 154

2023 02 02

- 討論未來型式

- 長期活動 organizer

- 場地贊助或是便宜的場地?

- 分享一下大家的實務經驗

148 of 154

Sharing

歡迎大家使用一到兩張的 Slide 做簡易的分享

149 of 154

場地候選?

資策會數位轉型研究院

重慶南路二段51號3F(捷運中正紀念堂站旁)

他辦公室有一間約可容納 20-30 人的會議室

150 of 154

場地候選?

玉山提供

  • 松山機場附近 20~ 40 人
  • 台大內部

某 Hunter 公司 nichebridge

  • 吉林路附近 (Mon/Fri 方便) 沒有桌子

根聚地

151 of 154

場地候選?

Dcard 提供

  • 國父紀念館捷運站,靠近華視
  • 6 ~ 60 人 會議室或者開放場地
  • 免費飲料零食
  • WIFI
  • 4/13 (四)

152 of 154

How to test REDSHIFT

Redshift 是 AWS 的 Data Ware House 的 Solution

但你 local 無法執行 如何 unit test ?

這裡介造一個套件來讓你測它

https://pypi.org/project/pytest-mock-resources/

153 of 154

pytest with testcontainers

154 of 154