2 of 10

Overview

What was the motivation (Since 2020) ?

What are the data lake scenarios from flink perspective ?

Why did we choose apache iceberg since 2019 ?

What's the roadmap of our flink-iceberg data lake ?

3 of 10

What was the motivation (Since 2020) ?

The data lake is a popular and flexible big data architecture.

Snowflake
AWS/Azure data lake products
Databricks

Apache Flink has been a popular unified batch & stream data processing framework.

Apache Flink is one of the most active open source project among apache projects.
The unified stream & batch processing ability is proven to be high-value, stable, and large-scale use in our alibaba tmall services.

Data processing framework is the pipeline but data is oil.

How could we deliver the best data value for our flink users ?
Construct our own flink’s data lake.

4 of 10

Raw Table

Refined Table

Aggregate Table

场景一: 构建近实时Data Pipeline

What are the data lake scenarios from flink perspective ?

5 of 10

场景二: CDC数据实时摄入摄出

What are the data lake scenarios from flink perspective ?

6 of 10

Change log

Sensor data

Streaming Analysis

Data Scientist

BI Users

场景三: 近实时场景的流批统一 (1)

What are the data lake scenarios from flink perspective ?

7 of 10

Change log

Sensor data

Streaming Analysis

Data Scientist

BI Users

场景三: 近实时场景的流批统一 (2)

What are the data lake scenarios from flink perspective ?

8 of 10

Flink wrote those records into apache iceberg.

Aggregated streaming records write to key-value database

Correcting real-time agg results by using Iceberg historical data

场景五: 通过Iceberg数据来订正实时聚合结果

Kafka stores the latest published records

Iceberg stores all the historical records

What are the data lake scenarios from flink perspective ?

9 of 10

Delta、Hudi、Iceberg对比

Iteams	Open Source Delta	Apache Iceberg	Apache Hudi
Open Source Time	2019/04/12	2018/11/06(incubation)	2019/01/17(incubation)
Github Star	3.2K	1.3K	1.8K
ACID	Yes	Yes	Yes
Isolation Level	Snapshot serialization	Snapshot Serialization	Write serialization
Time Travel	Yes	Yes	Yes
Row-level DELETE (batch)	Yes	Ongoing	No
Row-level DELETE (streaming)	No	Ongoing	Yes
Abstracted Schema	No	Yes	No
Engine Pluggable	No	Yes	No
Open File Format	Yes	Yes	Yes(Data) + No(Log)
Filter push down	No	Yes	No
Auto-Compaction	No	Ongoing	Yes
Python Support	Yes	Yes	No
File Encryption	No	Yes	No

Why did we choose apache iceberg since 2019 ?

10 of 10

	Apache Flink	Apache Iceberg	Powered by
Phase #1 (Connect to iceberg)	Apache Flink 1.11.0	Apache Iceberg 0.10.0 (Oct 2020) Flink streaming sink Flink batch sink Flink batch source	Tencent Netflix (flink+iceberg on AWS S3) Yilong.com (~ 100 iceberg tables) autohome.com (Replacing hive tables) Adobe / LinkedIn / Apple
Phase #2 (Replace hive table format)	Apache Flink 1.11.0	Apache Iceberg 0.11.0 (Jan 2021) Flink source improvement - filter/limit push down Flink streaming source Format v2: CDC/Upsert (Phase#1) - write & read correctness data. Major Compaction (Batch Mode).	autohome.com - CDC/Upsert POC
Phase #3 (Batch/Stream row-level delete)	Apache Flink 1.12.0	Apache Iceberg 0.12.0 (~ Apr 2021) Format v2: CDC/Upsert (Phase#2) - performance & stability Flink SQL imports CDC to iceberg.
Phase #4 (More powerful data lake)	Apache Flink 1.13.0 (?) SQL DDL SQL time travel.	Apache iceberg 0.13.0 (?) Integrate Ranger & Altas. Integrate with Alluxio SQL on everything (snapshots/manifests/partitions) More things.

Flink + Iceberg Roadmap

1 of 10

2 of 10

3 of 10

4 of 10

5 of 10

6 of 10

7 of 10

8 of 10

9 of 10

10 of 10