Enabling Data LakeHouse: Using Apache Hudi
#The Big Data show
Ankur Ranjan
Data Engineer III at Walmart
Here for you today
Ankur Ranjan
Table of Content
1
Data Warehouse vs Data Lake
2
Evaluation of Data LakeHouse
3
Enable LakeHouse using open table format
4
Optimization Benefits of enabling LakeHouse
Conclusion
5
Evolution of Data Storage Methodologies over Time
First Generation
Second Generation
Third Generation
Data Warehouse vs Data Lake
A data warehouse is a centralized repository that stores structured data from various sources for the purpose of BI and analytics.
Pros:
Cons:
A data lake is a storage repository for large amounts of raw data in diverse formats, supporting advanced analytics and exploration without requiring upfront data structuring, in contrast to traditional warehouses.�Pros:
Cons:
Evolution of Data LakeHouse
I have great data management features
I am scalable
and agile
Data Warehouse
Data Lake
Data Lakehouse
Let’s Merge it
Evolution of Data Storage Methodologies over Time
First Generation
Second Generation
Third Generation
A Data LakeHouse is a modern data architecture that combines the strengths of both data lakes and data warehouses. It seeks to address the limitations and drawbacks of each approach while providing a unified platform for storing, managing, and analyzing data. ��Let’s try to understand how open table table format can help to build Data LakeHouse�
Data LakeHouse
A new Approach
How to enable LakeHouse using Open table format like Apache Hudi
Arrow
Data Lake Storage: S3, GCS, HDFS
File Format: Parquet, ORC, Avro, CSV
Open Table Format: Apache Hudi, Iceberg, Delta Lake
Compute Or Query Engine:
Interfaces
JDBC/ODBC
Users and Application
Apache Hudi
Open Table Format
Apache Hudi (Hadoop Upserts, Deletes, and Incrementals) is an open-source data management framework designed to simplify incremental data processing and data management in large-scale big data environments.
Apache Hudi
Open Table Format
Record Key: Primary Key + Partition Path [ID + createdDate]
Precombine Key: Updated
Index: mapping between record key and file group/file id
Timeline: Event sequence of all actions performed on the table at different instants.
Data Files: Actual data file in parquet format.
Benefits of building Data LakeHouse using Apache Hudi
Data mutation, Better support for Row level Upsert or Merge
Schema enforcement, evolution & versioning
Better Transactions (ACID) support
Historical Data and Versioning: Time Travel
Partial Update Support
Delete Support: Hard and soft delete
Different Index and clustering support
Handle duplicates efficiently using Primary Key and DeDup key
Data mutation, Old Classic DataLake approach
Data mutation, Better support for Row level Upsert or Merge
Data Lake: Classic Approach - Read in all data, merge and overwrite
😥
LakeHouse: Approach - Read only required and modify only required
🙋
Better Partial Update support in LakeHouse architecture
Apache Hudi supports partial updates, where only a subset of fields in an existing record are updated. This is useful when:
Better Partial Update support in LakeHouse architecture
Benefits of building Data LakeHouse using Apache Hudi
Data mutation, Better support for Row level Upsert or Merge
Schema enforcement, evolution & versioning
Better Transactions (ACID) support
Historical Data and Versioning: Time Travel
Partial Update Support
Delete Support: Hard and soft delete
Different Index and clustering support
Handle duplicates efficiently using Primary Key and DeDup key
Reference
Q&A
#The Big Data show
Ankur Ranjan
Data Engineer III at Walmart