1 of 10

Building an enterprise-level data lake based on Flink+Iceberg

openinx@apache.org

2 of 10

Overview

  • What was the motivation (Since 2020) ?

  • What are the data lake scenarios from flink perspective ?

  • Why did we choose apache iceberg since 2019 ?

  • What's the roadmap of our flink-iceberg data lake ?

3 of 10

What was the motivation (Since 2020) ?

  • The data lake is a popular and flexible big data architecture.
    • Snowflake
    • AWS/Azure data lake products
    • Databricks

  • Apache Flink has been a popular unified batch & stream data processing framework.
    • Apache Flink is one of the most active open source project among apache projects.
    • The unified stream & batch processing ability is proven to be high-value, stable, and large-scale use in our alibaba tmall services.

  • Data processing framework is the pipeline but data is oil.
    • How could we deliver the best data value for our flink users ?
    • Construct our own flink’s data lake.

4 of 10

Raw Table

Refined Table

Aggregate Table

场景一: 构建近实时Data Pipeline

What are the data lake scenarios from flink perspective ?

5 of 10

场景二: CDC数据实时摄入摄出

What are the data lake scenarios from flink perspective ?

6 of 10

Change log

Sensor data

Streaming Analysis

Data Scientist

BI Users

场景三: 近实时场景的流批统一 (1)

What are the data lake scenarios from flink perspective ?

7 of 10

Change log

Sensor data

Streaming Analysis

Data Scientist

BI Users

场景三: 近实时场景的流批统一 (2)

What are the data lake scenarios from flink perspective ?

8 of 10

Flink wrote those records into apache iceberg.

Aggregated streaming records write to key-value database

Correcting real-time agg results by using Iceberg historical data

场景五: 通过Iceberg数据来订正实时聚合结果

Kafka stores the latest published records

Iceberg stores all the historical records

What are the data lake scenarios from flink perspective ?

9 of 10

Delta、Hudi、Iceberg对比

Iteams

Open Source Delta

Apache Iceberg

Apache Hudi

Open Source Time

2019/04/12

2018/11/06(incubation)

2019/01/17(incubation)

Github Star

3.2K

1.3K

1.8K

ACID

Yes

Yes

Yes

Isolation Level

Snapshot serialization

Snapshot Serialization

Write serialization

Time Travel

Yes

Yes

Yes

Row-level DELETE (batch)

Yes

Ongoing

No

Row-level DELETE (streaming)

No

Ongoing

Yes

Abstracted Schema

No

Yes

No

Engine Pluggable

No

Yes

No

Open File Format

Yes

Yes

Yes(Data) + No(Log)

Filter push down

No

Yes

No

Auto-Compaction

No

Ongoing

Yes

Python Support

Yes

Yes

No

File Encryption

No

Yes

No

Why did we choose apache iceberg since 2019 ?

10 of 10

Apache Flink

Apache Iceberg

Powered by

Phase #1

(Connect to iceberg)

Apache Flink 1.11.0

Apache Iceberg 0.10.0 (Oct 2020)

  1. Flink streaming sink
  2. Flink batch sink
  3. Flink batch source
  1. Tencent
  2. Netflix (flink+iceberg on AWS S3)
  3. Yilong.com (~ 100 iceberg tables)
  4. autohome.com (Replacing hive tables)
  5. Adobe / LinkedIn / Apple

Phase #2

(Replace hive table format)

Apache Flink 1.11.0

Apache Iceberg 0.11.0 (Jan 2021)

  1. Flink source improvement - filter/limit push down
  2. Flink streaming source
  3. Format v2: CDC/Upsert (Phase#1) - write & read correctness data.
  4. Major Compaction (Batch Mode).
  1. autohome.com - CDC/Upsert POC

Phase #3

(Batch/Stream row-level delete)

Apache Flink 1.12.0

Apache Iceberg 0.12.0 (~ Apr 2021)

  1. Format v2: CDC/Upsert (Phase#2) - performance & stability
  2. Flink SQL imports CDC to iceberg.

Phase #4

(More powerful data lake)

Apache Flink 1.13.0 (?)

  1. SQL DDL
  2. SQL time travel.

Apache iceberg 0.13.0 (?)

  1. Integrate Ranger & Altas.
  2. Integrate with Alluxio
  3. SQL on everything (snapshots/manifests/partitions)
  4. More things.

Flink + Iceberg Roadmap