📣 加入 Airflow Taiwan 社群 slack
① 加入 Airflow 官方 slack 群組:https://apache-airflow-slack.herokuapp.com/
② 搜尋 #users-taiwan 頻道
你是否有遇過以下情況?
進行方式
以討論形式舉辦,每個人都有 10 分鐘時間可以發言,可以任何方法呈現,例如投影片、口述等...
這是看不到的投影片
2023-12-07 Agenda
- 舉辦形式是否需要調整?例如新增讀書會/workshop? - Tai-Wei Huang
- 如何擴大邀請講者 - Tai-Wei Huang
- 爭取官方贊助? (貼紙) - Damon
- 更有組織的分工? (活動, 宣傳 …) - Damon
Taiwan Airflow Meetup
2023.11.09 @ Dcard Headquarters
Data validation and testing by pandera
wg
Taiwan Airflow Meetup
2023.10.05 @ Dcard Headquarters
Taiwan Airflow Meetup
2023.09.07 @ Dcard Headquarters
DataHub: 淺談導入Metadata 平台的功能與展望
蔡睿峰 Michael JF Tsai
About Me
15
Outline
16
Data Flow
17
Data Flow
18
Stack Holder
19
Data engineer team
Data analyst/scientist teams
Problems to DE
20
Problems to DA/DS
21
Problems in Financial industry
22
DataHub
23
https://datahubproject.io/docs/architecture/architecture
DataHub POC Setup
24
DataHub Features
25
DataHub Features
26
DataHub Features
27
DataHub Features
28
DataHub Features
29
DataHub Features
30
Problems to DE
31
Problems to DA/DS
Data owner, data discovery, view metadata of data, data lineage
32
Problems in Financial industry
Data management, data owner, Authorization
33
Feature Work
34
Thanks for listening��Q&A
35
智慧財產權聲明
本資料各項內容之各項權利及智慧財產權(包括但不限於著作權、專利權、商標權等)均屬玉山金融控股股份有限公司及其子公司(以下簡稱「玉山金控」)所有。除非獲得玉山金控事前書面同意外,均不得擅自以任何形式複製、重製、修改、發行、上傳、張貼、傳送、散佈、公開傳播、販售或其他非法使用本資料。除非有明確表示,本資料之提供並無明示或暗示授權貴方任何著作權、專利權、商標權、商業機密或任何其他智慧財產權。
Intellectual Property Rights
The rights and the intellectual property rights (including but not limited to the copyrights, patents and trademarks, and etc.) of the Material belongs to E.SUN Financial Holding Co., Ltd. and its subsidiaries (hereinafter referred to as “E.SUN”). Any copy, reproduction, modification, upload, post, distribution, transmission, sale or illegal usage of the Material in any way shall be strictly prohibited without the prior written permission of E.SUN. Except as expressly provided herein, E.SUN does not, in providing this Material, grant any express or implied right to you under any patents, copyrights, trademarks, trade secret or any other intellectual property rights.
Datahub - Dcard
Damon
Comparison of Amundsen and Datahub
Admin
Tool | Datahub | Amundsen |
Monitoring | Yes | No |
Authorization | Metadata polices | In the roadmap |
Ingestion | Pulgin based | ETL based |
Compoments Management | Medium | Easy |
Metadata Management | Medium | Easy |
Metadata Architectures | Stream + API: Push | Crawl Based |
Backend | Java | Python |
User
Tool | Datahub | Amundsen |
Discovery | Overview, Domains, Glossary Terms, Data Product | Tags |
Lineage | Table, Column | Table, Column |
Metadata | Schema and column version, Historical stats, Addtional properties | Current stats, Table and dashboard relationships |
The features for workflow
Github actions
This use case provide ideas of wanring and showing effection of target changes.
Glosary terms sync action
This use case provide ideas to monitoring the neccery updation of the target chnages.
Datahub action framework
Monitoring data pipelines
Pipelines Monitoring - Dcard
Damon
Monitoring metrics
Airflow monitoring
Architecture for Airflow monitoring
Send P0 notification when the Airflow cluster…
Send P1 notification when the Airflow cluster…
Raw data monitoring
Architecture for raw data monitoring
GE UI
Runs overview
GE UI
The run overview
GE UI
The suite information
Testing
Talk to the stakeholders
Failure task monitoring
Data detail of failure task
Airflow Meetup in October
https://www.meetup.com/taipei-py/events/295948520/?isFirstPublish=true
🏠地點:TBD�📆時間:2023/10/5 (四) 19:15 入場;19:30 - 21:30�📌主題:
Taiwan Airflow Meetup
2023.07.06 @ Dcard Headquarters
Taiwan Airflow Meetup
2023.06.01 @ Dcard Headquarters
📣 COSCUP PyCon 社群軌徵稿中!
Taiwan Airflow Meetup
2023.05.11 @ Dcard Headquarters
粗暴的資料處理
Alex Hsieh (DouEnergy)
Airflow Taiwan User Meetup #4, 2023 May
What is DuckDB ?
"If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating"
Hadley Wickham ( Chief Scientist at RStudio)
Bro! I dot car 🚗 any new database 😵💫
Source: Peter Boncz
Why DuckDB ?
Easy
Easy
Fast Ritchie Vink(polars)
Fast duckdblab
Open
How to test dags?
Tai-Wei Huang
Taiwan Airflow Meetup
2023.04.13 @ Dcard Headquarters
Survey Workflow Management Framework
Tai-Wei Huang
Candidates
註:以下比較版本以 latest 版為主
| | | |
2011 年發佈,以 Java 構成,老牌開源自動化處理工具,目前主要應用以 CI/CD 為主,目前 stable 版本為 2.387.2 LTS | 2015 年發佈,以 Python 構成的排程管理工具,目前主要應用以 data processing, ETL, ML training 等為主,目前版本為 v2.5.3 | 2018 年發佈,以 Python 構成的排程管理工具,目前主要應用以 data processing, ETL, ML training 等為主,目前版本為 v2.10.2 | 2018 年發佈,container-native 工作流程管理引擎,提供編排執行於 Kubernetes 的 job,general purpose workflow tools,目前到 v3.4.6 |
Workflow Definition Language
| | | |
|
|
|
|
Calendar
| | | |
|
|
|
|
Calendar
Calendar
Backfill
| | | |
|
|
|
|
Calendar
Event trigger
| | | |
|
Context in Workflow
| | | |
|
|
|
|
Context in Workflow
Context in Workflow
Dynamic Workflow
| | | |
|
Operators
| | | |
|
|
|
|
Operators
Operators
Operators
Data Transfer Between Jobs
| | | |
|
|
|
|
Data Transfer Between Jobs
Data Transfer Between Jobs
Job Queue Priority
| | | |
|
|
|
Job Queue Priority
Job Queue Priority
Installation
| | | |
|
|
|
|
Kubernetes
| | | |
|
|
|
Maintenance - Scale
| | | |
|
|
|
|
Maintenance - Operation
| | | |
|
|
|
|
Maintenance - Debug
| | | |
|
|
|
|
Maintenance - Upgrade
| | | |
|
|
|
|
Monitoring
| | | |
|
|
Pros | Cons |
|
|
Alerting
| | | |
|
|
|
|
Pros | Cons |
|
|
Pros | Cons |
|
|
Pros | Cons |
|
|
Dependency Management
## PIP tool
https://github.com/jazzband/pip-tools#workflow-for-layered-requirements
production example
# requirements.in
airflow==2.0.2
$ pip-compile # produce a requirements.txt
# requirements.txt
airflow ==2.0.2
# via -r requirements.in
pandas==1.5
# via airflow
local example
# dev-requirements.in
-c requirements.txt
pytest
$ pip-compile dev-requirements.in
pytest==7.1.2
install
2023/03/02
Airflow 升級經驗分享
Damon
Context
Work Flow
Upgrade Plan
Testing Cluster
DAG Updates
DB Migration
DAG Testing
Server Testing
Staging Cluster
DAG Updates
DB Migration
DAG Testing
Server Testing
Prod Cluster
DAG Updates
DB Migration
DAG Testing
Server Testing
Intro the work flow
Intro the new features
Notes after the upgrade
DB migration took a very long time
現象:The duration of the production DB migration was underestimated, it cause all of the pipelines blocked for 5 hours.
解決:Consider DB migration with zero downtime
Report DAGs runtime error
現象:The DAGs was runtime error after the upgrade because the data type was changed due to version of pandas library changes.
解決:Use the Python virtual environment for the tasks
Testing Cluster
Prod Cluster
Staging Cluster
Improvement ideas
2023/02/02
2023 02 02
- 討論未來型式
- 長期活動 organizer
- 場地贊助或是便宜的場地?
- 分享一下大家的實務經驗
Sharing
歡迎大家使用一到兩張的 Slide 做簡易的分享
場地候選?
資策會數位轉型研究院
重慶南路二段51號3F(捷運中正紀念堂站旁)
他辦公室有一間約可容納 20-30 人的會議室
場地候選?
玉山提供
某 Hunter 公司 nichebridge
根聚地
場地候選?
Dcard 提供
How to test REDSHIFT
Redshift 是 AWS 的 Data Ware House 的 Solution
但你 local 無法執行 如何 unit test ?
這裡介造一個套件來讓你測它
https://pypi.org/project/pytest-mock-resources/
pytest with testcontainers