1 of 29

CDC Streams by the Lakehouse on Apache Hudi

Sagar Sumit

Apache Hudi/Onehouse

2 of 29

Speaker Bio

Sagar Sumit

  • Software Engineer@Onehouse
  • Committer@Apache Hudi
  • Software Engineer@AWS(Amazon Aurora)
  • Member Technical Staff@Oracle �(Oracle GoldenGate)

3 of 29

Agenda

  1. Lakehouse Architecture
  2. Challenges with Streaming Data on Data Lake
  3. Solutions with Flink and Kafka on Hudi
  4. Hudi as the Lakehouse Platform

4 of 29

Lakehouse Architecture

5 of 29

Origins@Uber 2016

Context

  • Uber in hypergrowth
  • Moving from warehouse to lake
  • HDFS/Cloud storage is immutable

Problems

  • Extremely poor ingest performance
  • Wasteful reading/writing
  • Zero concurrency control or ACID

6 of 29

Hudi Data Lake

  • Database abstraction for cloud storage/hdfs
  • Near real-time ingestion using ACID updates
  • Incremental, Efficient ETL downstream
  • Built-in table services

7 of 29

8 of 29

Need for Lakehouse

Reliable data management on data lake

  • 2-tier data architectures with a separate lake and warehouse add extra complexity
  • Freshness, consistency and isolation

Support for ML/DS on top of lake

  • Structured vs Unstructured data
  • ML/DS applications suffer from the same data management problems as classical applications

Performance

  • Storage layout optimization
  • Efficiently locate records

9 of 29

Challenges with Streaming on Data Lake

10 of 29

Changelog Stream Ingestion

  • Files on Object Storage is immutable
  • Numerous small files
  • Manual compaction and clustering

11 of 29

Materialization

  • Dynamic table is not queryable
  • Neither can it be shared among jobs
  • No History view query
  • No Schema Evolution

12 of 29

Incremental ETL

  • Preserve Event Sequence of the same key
  • Exactly once
  • Consumption offset

13 of 29

Solutions with Flink and Kafka on Hudi

14 of 29

15 of 29

Record Level Index

  • Pluggable index layer
  • File Layout to manage updates and small files
  • Multi-version data sets tagged with version id

16 of 29

Record Level Index

  • Record level index with kv state-backend
  • Exactly-once semantics with checkpoints
  • Automatic table services: cleaning/compaction

17 of 29

Materialization

  • Preserve all the change operations
  • Preserve the real event time sequence for one key

18 of 29

Incremental ETL

  • Monitor the incremental dataset by new instants
  • The timeline and fs view guard the visibility of records
  • Support specific start offset

19 of 29

Incremental ETL

  • Medallion arch in e2e streaming style
  • Hudi table is the carrier to persist the Dynamic table and propagate the changes to downstream

20 of 29

Hudi as a Lakehouse Platform

Lake Storage(Cloud Object Stores, HDFS, …)

Open File/Data Formats(Parquet, HFile, Avro, Orc, …)

Concurrency Control�(OCC, MVCC, Non-blocking, Lock providers, Orchestration, Scheduling...)

Table Services(cleaning, compaction, clustering, indexing, file sizing,...)

Indexes�(Bloom filter, HBase, Bucket index, Hash based, Lucene..)

Table Format�(Schema, File listings, Stats, Evolution, …)

Lake Cache�(Columnar, transactional, mutable, WIP,...)

Metaserver�(Stats, table service coordination,...)

SQL Query Engines�(Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..)

Platform Services�(Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...)

Transactional�Database�Layer

Execution/Runtimes

21 of 29

Hudi Table Format

22 of 29

File Group Structure for a MOR table

23 of 29

Multi-modal indexing sub-system

Scalable metadata table

  • Internal MoR table
  • Different partitions store different stats, indexes

Many types of indexes

  • Files, Col-stats, bloom filters, Record Index, secondary indexes, etc

Async Indexer

  • Concurrently build index partitions
  • 0-downtime operation

24 of 29

Hudi Tech Stack on Cloud

25 of 29

Roadmap

  • First class support for CDC data
    • Incremental queries�
  • Record level index
    • Global index
    • Performs better for random updates

  • New Table + merge APIs
    • Easier Reader/Writer integrations
    • Engine specific merge implementations

  • First class support for Unstructured data
    • JSON blobs, images, videos
    • Same capabilities around indexing and capturing changes
  • https://github.com/apache/hudi/pull/8679

26 of 29

Resources

27 of 29

The Community

3000+

Slack Members

300+

Contributors

3000+

GH Engagers

30+

Committers

Pre-installed on 5 cloud providers

Diverse PMC/Committers

1M DLs/month�(400% YoY)

800B+

Records/Day

(from even just 1 customer!)

Rich community of participants

28 of 29

Engage With Our Community

Join Hudi Slack

29 of 29

Thanks!

Questions?