CDC Streams by the Lakehouse on Apache Hudi
Sagar Sumit
Apache Hudi/Onehouse
Speaker Bio
Sagar Sumit
Agenda
Lakehouse Architecture
Origins@Uber 2016
Context
Problems
Hudi Data Lake
Need for Lakehouse
Reliable data management on data lake
Support for ML/DS on top of lake
Performance
Challenges with Streaming on Data Lake
Changelog Stream Ingestion
Materialization
Incremental ETL
Solutions with Flink and Kafka on Hudi
Record Level Index
Record Level Index
Materialization
Incremental ETL
Incremental ETL
Hudi as a Lakehouse Platform
Lake Storage�(Cloud Object Stores, HDFS, …)
Open File/Data Formats�(Parquet, HFile, Avro, Orc, …)
Concurrency Control�(OCC, MVCC, Non-blocking, Lock providers, Orchestration, Scheduling...)
Table Services�(cleaning, compaction, clustering, indexing, file sizing,...)
Indexes�(Bloom filter, HBase, Bucket index, Hash based, Lucene..)
Table Format�(Schema, File listings, Stats, Evolution, …)
Lake Cache�(Columnar, transactional, mutable, WIP,...)
Metaserver�(Stats, table service coordination,...)
SQL Query Engines�(Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..)
Platform Services�(Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...)
Transactional�Database�Layer
Execution/Runtimes
Hudi Table Format
File Group Structure for a MOR table
Multi-modal indexing sub-system
Scalable metadata table
Many types of indexes
Async Indexer
Hudi Tech Stack on Cloud
Roadmap
Resources
https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101
https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102
https://www.onehouse.ai/blog/intro-to-hudi-and-flink
https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md
https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform
https://hudi.apache.org/docs/overview#core-concepts-to-learn
The Community
3000+
Slack Members
300+
Contributors
3000+
GH Engagers
30+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month�(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants
Engage With Our Community
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1e94d3xro-JvlNO1kSeIHJBTVfLPlI5w
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) : � dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
Thanks!
Questions?