Building an enterprise-level data lake based on Flink+Iceberg
openinx@apache.org
Overview
What was the motivation (Since 2020) ?
Raw Table
Refined Table
Aggregate Table
场景一: 构建近实时Data Pipeline
What are the data lake scenarios from flink perspective ?
场景二: CDC数据实时摄入摄出
What are the data lake scenarios from flink perspective ?
Change log
Sensor data
Streaming Analysis
Data Scientist
BI Users
场景三: 近实时场景的流批统一 (1)
What are the data lake scenarios from flink perspective ?
Change log
Sensor data
Streaming Analysis
Data Scientist
BI Users
场景三: 近实时场景的流批统一 (2)
What are the data lake scenarios from flink perspective ?
Flink wrote those records into apache iceberg.
Aggregated streaming records write to key-value database
Correcting real-time agg results by using Iceberg historical data
场景五: 通过Iceberg数据来订正实时聚合结果
Kafka stores the latest published records
Iceberg stores all the historical records
What are the data lake scenarios from flink perspective ?
Delta、Hudi、Iceberg对比
Iteams | Open Source Delta | Apache Iceberg | Apache Hudi |
Open Source Time | 2019/04/12 | 2018/11/06(incubation) | 2019/01/17(incubation) |
Github Star | 3.2K | 1.3K | 1.8K |
ACID | Yes | Yes | Yes |
Isolation Level | Snapshot serialization | Snapshot Serialization | Write serialization |
Time Travel | Yes | Yes | Yes |
Row-level DELETE (batch) | Yes | Ongoing | No |
Row-level DELETE (streaming) | No | Ongoing | Yes |
Abstracted Schema | No | Yes | No |
Engine Pluggable | No | Yes | No |
Open File Format | Yes | Yes | Yes(Data) + No(Log) |
Filter push down | No | Yes | No |
Auto-Compaction | No | Ongoing | Yes |
Python Support | Yes | Yes | No |
File Encryption | No | Yes | No |
Why did we choose apache iceberg since 2019 ?
| Apache Flink | Apache Iceberg | Powered by |
Phase #1 (Connect to iceberg) | Apache Flink 1.11.0 | Apache Iceberg 0.10.0 (Oct 2020)
|
|
Phase #2 (Replace hive table format) | Apache Flink 1.11.0 | Apache Iceberg 0.11.0 (Jan 2021)
|
|
Phase #3 (Batch/Stream row-level delete) | Apache Flink 1.12.0 | Apache Iceberg 0.12.0 (~ Apr 2021)
| |
Phase #4 (More powerful data lake) | Apache Flink 1.13.0 (?)
| Apache iceberg 0.13.0 (?)
| |
Flink + Iceberg Roadmap