1 of 21

Enabling Data LakeHouse: Using Apache Hudi

#The Big Data show

Ankur Ranjan

Data Engineer III at Walmart

2 of 21

Here for you today

Ankur Ranjan

Data Engineer III at Walmart
Manage my YouTube channel - The Big Data Show
You can find me on LinkedIn. I write post and articles related to my learning and experimentation with Data Engineering

3 of 21

Table of Content

Data Warehouse vs Data Lake

Evaluation of Data LakeHouse

Enable LakeHouse using open table format

Optimization Benefits of enabling LakeHouse

cases in real industry!

Conclusion

4 of 21

Evolution of Data Storage Methodologies over Time

First Generation

Second Generation

Third Generation

5 of 21

Data Warehouse vs Data Lake

A data warehouse is a centralized repository that stores structured data from various sources for the purpose of BI and analytics.

Pros:

Centralized Data
Structured Data
Performance
Data Governance
Easy to enforce Data Quality

Cons:

Limited Support for Unstructured Data
High Initial Cost
Schema Rigidity

A data lake is a storage repository for large amounts of raw data in diverse formats, supporting advanced analytics and exploration without requiring upfront data structuring, in contrast to traditional warehouses.�Pros:

Scalability & Flexibility
Cost-Effective Storage & �Advanced Analytics
Better handle for semi and unstructured data

Cons:

Data Quality & Governance Challenges
Hard to maintain ACID
Security and Privacy Concerns

6 of 21

Evolution of Data LakeHouse

I have great data management features

I am scalable

and agile

Data Warehouse

Data Lake

Data Lakehouse

Let’s Merge it

Time Travel and Versioning
Better support for UPSERT and DELETE
Better Data Governance support

7 of 21

Evolution of Data Storage Methodologies over Time

First Generation

Second Generation

Third Generation

8 of 21

A Data LakeHouse is a modern data architecture that combines the strengths of both data lakes and data warehouses. It seeks to address the limitations and drawbacks of each approach while providing a unified platform for storing, managing, and analyzing data. ��Let’s try to understand how open table table format can help to build Data LakeHouse�

Data LakeHouse

A new Approach

9 of 21

How to enable LakeHouse using Open table format like Apache Hudi

Arrow

Data Lake Storage: S3, GCS, HDFS

File Format: Parquet, ORC, Avro, CSV

Open Table Format: Apache Hudi, Iceberg, Delta Lake

Compute Or Query Engine:

Interfaces

JDBC/ODBC

Users and Application

10 of 21

Apache Hudi

Open Table Format

Apache Hudi (Hadoop Upserts, Deletes, and Incrementals) is an open-source data management framework designed to simplify incremental data processing and data management in large-scale big data environments.

11 of 21

Apache Hudi

Open Table Format

Record Key: Primary Key + Partition Path [ID + createdDate]

Precombine Key: Updated

Index: mapping between record key and file group/file id

Timeline: Event sequence of all actions performed on the table at different instants.

Data Files: Actual data file in parquet format.

12 of 21

Benefits of building Data LakeHouse using Apache Hudi

Data mutation, Better support for Row level Upsert or Merge

Schema enforcement, evolution & versioning

Better Transactions (ACID) support

Historical Data and Versioning: Time Travel

Partial Update Support

Delete Support: Hard and soft delete

Different Index and clustering support

Handle duplicates efficiently using Primary Key and DeDup key

13 of 21

Data mutation, Old Classic DataLake approach

14 of 21

Data mutation, Better support for Row level Upsert or Merge

15 of 21

Data Lake: Classic Approach - Read in all data, merge and overwrite

😥

16 of 21

LakeHouse: Approach - Read only required and modify only required

🙋

17 of 21

Better Partial Update support in LakeHouse architecture

Apache Hudi supports partial updates, where only a subset of fields in an existing record are updated. This is useful when:

Only certain fields in a record have changed, and you don't want to overwrite the entire record
Handling late arriving data, where some fields may be missing
Hudi enables partial updates through its HoodieRecordPayload interface:

18 of 21

Better Partial Update support in LakeHouse architecture

Implementing Partial Updates in the Data Lake is a tedious process. Most of the solution are very slow and time consuming.
Use of NoSQL database like MongoDb just to support partial update in big data pipeline. Using cloud mongo API provider like Azure Cosmos DB comes with a lot of cost.
LakeHouse support partial updates in a very efficient way and very less cost.
In Apache Hudi you just have to set some configuration and it will help you to implement partial update.
'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.PartialUpdateAvroPayload'