1 of 29

CDC Streams by the Lakehouse on Apache Hudi

Sagar Sumit

Apache Hudi/Onehouse

2 of 29

Speaker Bio

Sagar Sumit

Software Engineer@Onehouse
Committer@Apache Hudi
Software Engineer@AWS�(Amazon Aurora)
Member Technical Staff@Oracle �(Oracle GoldenGate)

3 of 29

Agenda

Lakehouse Architecture
Challenges with Streaming Data on Data Lake
Solutions with Flink and Kafka on Hudi
Hudi as the Lakehouse Platform

4 of 29

Lakehouse Architecture

5 of 29

Origins@Uber 2016

Context

Uber in hypergrowth
Moving from warehouse to lake
HDFS/Cloud storage is immutable

Problems

Extremely poor ingest performance
Wasteful reading/writing
Zero concurrency control or ACID

6 of 29

Hudi Data Lake

Database abstraction for cloud storage/hdfs
Near real-time ingestion using ACID updates
Incremental, Efficient ETL downstream
Built-in table services

7 of 29

Lakehouse Paper�Hudi Vision

8 of 29

Need for Lakehouse

Reliable data management on data lake

2-tier data architectures with a separate lake and warehouse add extra complexity
Freshness, consistency and isolation

Support for ML/DS on top of lake

Structured vs Unstructured data
ML/DS applications suffer from the same data management problems as classical applications

Performance

Storage layout optimization
Efficiently locate records

9 of 29

Challenges with Streaming on Data Lake

10 of 29

Changelog Stream Ingestion

Files on Object Storage is immutable
Numerous small files
Manual compaction and clustering

The first challenge is the changelog stream ingestion.��Recently change data capture tools like Debezium got more and more widely used for capturing the database events committed from other applications. While in data warehousing, the CDC capture tech can be used for real-time syncing as a mirror table from the database onto the file system. We can then take advantage of the low cost and high scalability of the storage on cloud, we can do flexible ad-hoc analytical queries from this mirror table.��Ingesting change log stream into a filesystem storage is a tricky issue for years. ��1. First of all, the files located on the filesystem are very probaby immutable, the file format you choose may also be immutable(like Parquet or ORC).�This made us must think through how to apply all the changes to these files because they can not be modified in place.

2. Besides that, the files are committed in higher frequency from streaming ingestion job, people usually have minutes time-interval to commit new files to the storage, the file numbers is soaring.

These files are small files in notion too because records reception in minutes can rarely be large. This results in tremendous number of small files on the storage if no compaction are included.

3. In order to optimize the small files and to speed up the queries, compaction or clustering should be taken into consideration, but people are exhausted if all these file maneuvering job is in manual.

11 of 29

Materialization

Dynamic table is not queryable
Neither can it be shared among jobs
No History view query
No Schema Evolution

The second challenge is the changelog stream materialization.

�Flink, known for its stateful streaming capabilities, utilizes Dynamic Tables as a first-class citizen. These tables allow for the accumulation of records through running computations and then propagate the results as another dynamic table. However, there are certain limitations to consider for Flink’s Dynamic Tables:

The Dynamic Table is exclusively handled by the Flink engine and cannot be accessed by other engines directly. If another engine needs to query the dynamic table, it must first synchronize the data into external storage.

Dynamic Tables cannot be shared among multiple jobs, posing a challenge when dealing with computation operators like SQL JOIN that require significant resources. Sharing the intermediate result set would be beneficial in terms of cost-saving.

There is no history snapshot view of the dynamic table, and schema evolution is not supported. This means that there is no built-in capability to access previous versions or historical states of the dynamic table, nor is there a straightforward mechanism for handling changes in the table’s schema over time.

12 of 29

Incremental ETL

Preserve Event Sequence of the same key
Exactly once
Consumption offset

The third challenge is the incremental ETL.

If we want to make a medallion architecture in streaming style, that is the producer and consumer are all streaming pipelines, the buffering message queue is pure files on the filesystem.

We need to meet these preconditions:��1. The preservation of event sequence is very critical for streaming computation, engine like Apache Flink has no built-in support for event time sequence when handing the change log records.

If the records are in disorder and do not comply with the real event time sequence, the accumulation result could be wrong.

2. Both the writer and reader job need to have the exactly-once semantics to keep the correctness.

3. The reader also needs a good manner to bookkeep the consumption offset so that it can recover in more light-weight style.��

13 of 29

Solutions with Flink and Kafka on Hudi

14 of 29

15 of 29

Record Level Index

Pluggable index layer
File Layout to manage updates and small files
Multi-version data sets tagged with version id

16 of 29

Record Level Index

Record level index with kv state-backend
Exactly-once semantics with checkpoints
Automatic table services: cleaning/compaction

17 of 29

Materialization

Preserve all the change operations
Preserve the real event time sequence for one key

18 of 29

Incremental ETL

Monitor the incremental dataset by new instants
The timeline and fs view guard the visibility of records
Support specific start offset

If we want a medallion architecture with streaming style end-to-end, a crucial issue to resolve is the incremental consumption from the Hudi source.

The first question is in what granularity the source makes the new data set visible to the downstream.

Because the writer commits the data set into separate commits in Hudi, it is natural that the inc data sets are sent to streaming reader in Hudi commits. Hudi has a timeline to manage all the commits,

It has a FileSystem view to manage the visibility of all the files on the filesystem. Only files that are committed can sent to reader.

The second question is how to retract the intermediate handling if there is a job failover. The incremental dataset info are abstracted into splits, the streaming reader would bookkeep the offset of each split

It has consumed, once the reader task got recovered, it would continue with the partial offset from last successful checkpoint.

The streaming reader can have a specific start offset to consume from. Like the ‘earliest’ offset to consume the full data set.

19 of 29

Incremental ETL

Medallion arch in e2e streaming style
Hudi table is the carrier to persist the Dynamic table and propagate the changes to downstream

20 of 29

Hudi as a Lakehouse Platform

Lake Storage�(Cloud Object Stores, HDFS, …)

Open File/Data Formats�(Parquet, HFile, Avro, Orc, …)

Concurrency Control�(OCC, MVCC, Non-blocking, Lock providers, Orchestration, Scheduling...)

Table Services�(cleaning, compaction, clustering, indexing, file sizing,...)

Indexes�(Bloom filter, HBase, Bucket index, Hash based, Lucene..)

Table Format�(Schema, File listings, Stats, Evolution, …)

Lake Cache�(Columnar, transactional, mutable, WIP,...)

Metaserver�(Stats, table service coordination,...)

SQL Query Engines�(Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..)

Platform Services�(Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...)

Transactional�Database�Layer

Execution/Runtimes

Lets take a look at Hudi’s platform.

At the bottom you have the lake storage like hdfs, S3 etc.

And you have files stored in open formats like parquet, avro, Orc etc.

Next comes the table format. This is used to store the schema of your table, file listing, some stats to assist w/ queries.

Table services are one of the key component in Hudi right from beginning. Cleaning is used to reclaim additional space. Compaction in MOR table assists by compaction base and log files to newer versions of base files. Again, key thing to note here is that, Hudi has its own runtime which can assist in scheduling and executing these table services. You don’t need apache air flow or any of these 3rd party services to assist you.

Indexes also plays a key role in reducing your write and read latency. Hudi has lot of built in indexes for a very long time. Like bloom filter, hbase, bucket index etc.

Hudi supports various concurrent control mechanisms. Multiple writers can write to hudi concurrently using OCC. For async table services, MVCC non-blocking has been very much appreciated by many in the community.

Lake Cache and MetaServer is something currently being worked on and might be out w/ 0.13.0 or 1.0.

This entire layer we call it as Transaction or Databasy layer as its seen as the Kernel of Hudi.

At the top you have execution or Runtime. For eg, you have support for diff query engines like spark, presto, trino, BigQuery, etc.

And hudi also boasts a rich set of platform services. You have various catalog syns to sync to various meta syncs. You have a nice admin cli to assist w/ debugging. Rich set of metrics w/ variosu metrics reporter to assist w/ monitoring. We have data quality check framework support for your writes as well.

21 of 29

Hudi Table Format

A hudi table is consists of file slices. Each slice contains a base file (.parquet) produced at a certain commit/compaction instant time, along with set of log files (.log*) that contain inserts/updates to the base file since the base file was produced.

A group of file slices is known as a file group.

When a write comes in, the records are written to the flice slices. Each record has a key that is mapped to a particular file group. If a record has an update, Hudi checks the record’s key to see if the record exist in the file group. If it does, it updates that particular file slice, depending on where the record is located. The different files slices of a particular file group correspond to different versions of data, which unlocks Multi Version concurrency control.

Let’s talk about the advantage of having this particular file layout-

First, from the point of writing to a hudi table, if you have multiple table services running in the background along side with ingestion, the services don't block each other because of the multiversion concurrency control mechanism in hudi.
Second, from the point of view of querying, the hoodie table, the file layout facilitates the query engine, like trino, to being able to query a table at a particular point in time.

Now, this is the top half of the diagram.

On the bottom half with part 2, when a record gets written into a file group, Hudi’s timeline also records what commit action was done in the timeline. The timeline in the .hoodie folder is essentially an event log. There are different commit actions that you can have: such as clustering, deltacommit, rollback, and so on. A key thing to point out in the hoodie timeline is there are timestamps associated with every commit action, along with some metadata about what was committed.

Now, the metadata table that follows next to the timeline . The metadata table is different from a timeline, in that it’s actually an internal merge on read table in the hudi data table.

The metadata table is where each partition stores information about the metadata about the file. For example, the column stats partition will store information about the column min max values, null counts and so on. The metadata table also stores files partitions, which stores information about what files are located within, which partition.

The metadata table is a central place for all this metadata. For example, earlier I mentioned that when writes occur to a hudi table, automatically the footer of each parquet file has a bloom index. Additionally, the metadata table stores the bloom index from all the files. Instead opening and closing a file to see whether a record may exist, Hudi can just check the metadata table directly and save on the file IO cost. When a commit happens, the metadata table gets equally updated as well. You can think of the metadat table as an index. You're created an index for the existing data, whenever a new update arrives, the data, you update the index as well.

22 of 29

File Group Structure for a MOR table

Now, Hudi has 2 different table types: Copy on write (COW) table and a merge on read (MOR). Copy on write tables have a simpler file structure- so we won’t mention it here. I’ll mention the differences with COW inline as I’m double clicking into the MOR file structure.

In this slide we’ll, we’re double clicking into what a file group looks like in an MOR table. To recap, each file group contains multiple flie slices. For MOR tables, each file slice contains a base file plus log files. For COW tables, it’ll just be base files alone. Now, each log file will contain multiple data blocks. A data block stores data when each commit (or write) occurs. This structure is idea for streaming sources because the write amplification is 0. The reason is because there no synchronous merges happening at write time– Hudi automatically creates these data blocks with each write.

This is different with COW tables, because the synchronous merges happens at write time. This is why COW tables just contain base files in the file slice and the write amplification is high.

But how do these log files get merge into a base file? This is through a table service called compaction. During compaction, Hudi orchestrates and compacts these log files, forming a new version of a base file. In the diagram, this is base file b’. When new writes occur after compaction, they get written to these blocks. And the cycle continues. Sync the log files get merged later- the read amplification for MOR tables are higher than COW tables.

On the bottom we have the timeline records of different commit actions and state at some point in time when they occur.

To recap on this section:

We’ve had an overview of some of the bottle necks with building applications that need to scale and what efficiency may look like. Then we went over some of the features and services Hudi offers. Then we got a closer look at the file layout in Hudi and how the timeline and metadata play a role in Hudi’s architecture. I’ll pass it to Sagar to talk about the challenges of writing and querying data at low latency with data lakes. Sagar.

23 of 29

Multi-modal indexing sub-system

Scalable metadata table

Internal MoR table
Different partitions store different stats, indexes

Many types of indexes

Files, Col-stats, bloom filters, Record Index, secondary indexes, etc

Async Indexer

Concurrently build index partitions
0-downtime operation

If you look at the what happens behind the scenes when executing the query,

First step involves fetching the metadata, followed by query planning, applying some optimizations and then eventually reading the data before serving them.

So, fetching the metadata is key to improving your query performance.

Indexes is not new to someone coming from a database world. Its just a pointer to your actual data. Indexes could vary depending on your needs like coarse grained like bloom filter, to find grained like record level index.

Hudi employs indexes for various purposes.

Some of the indexes are used to improve your write performance. But we have also started adding indexes or metadata to improve your query performance.

For eg, one of the partitions stores the valid files across your entire table. Another partition stores the column stats for every file. And so on.

Hudi has built a multi-modal indexing sub-system to support all these indexes to serve both your reads and writes. The foundation is buit in a such a manner so that adding a new index is easier than ever before.

And to top all of this, you can build such indexes w/o blocking or impacting your ingestion. How cool is that?