Intro to DataFusion: Technology, Community, and Not Quite Enough Time
| © Copyright 2024, InfluxData
1
Outline
| © Copyright 2024, InfluxData
2
Thank you to our Sponsors
Room and facilities
(thanks Justin & Gladys!)
| © Copyright 2024, InfluxData
3
Thank you to our Sponsors
Food
| © Copyright 2024, InfluxData
4
Talks
6:00 - 6:20: Intro to DataFusion: Technology, Community, and Not Quite Enough Time
Andrew Lamb is a Staff Engineer at InfluxData, working on InfluxDB 3.0, focused on query processing, the Apache DataFusion query engine and the Apache Arrow ecosystem. He is a member of the Apache Software Foundation and the Apache DataFusion PMC (2024 Chair), and Apache Arrow PMC (2023 Chair) (LinkedIn).
6:20 - 6:40: Embucket: a Snowflake-compatible lakehouse based on DataFusion
Camuel Gilyadov is Founder and CEO of Embucket (LinkedIn)
6:40 - 7:00: Stateless engines, table formats, and the future
Jake Thomas is Manager, Data Foundations at Okta (LinkedIn)
6:40 - 7:00: Building InfluxDB 3.0 with DataFusion
Andrew Lamb
Update: Jake is unable to make it tonight 🤧
| © Copyright 2024, InfluxData
5
DataFusion: Technology
| © Copyright 2024, InfluxData
6
Top Level Project, Apache Software Foundation
Apache 2.0 Licensed
| © Copyright 2024, InfluxData
7
Analogy: DataFusion is LLVM for Databases
Clang
Rust
Swift
C/C++ frontend
LLVM
Rustlang frontend
LLVM
Swift frontend
LLVM
…
LLVM enabled innovation in programming languages:
| © Copyright 2024, InfluxData
8
Analogy: DataFusion is LLVM for Databases
Analytic Application
Domain Specific Language
Specialized Database
Application Logic
Catalog
Analysis Engine
Multiple SQL Dialects
Data Flow Analysis
Custom Operators
File System Interface
…
DataFusion enables innovation in data intensive systems
| © Copyright 2024, InfluxData
9
Recognized Tech
| © Copyright 2024, InfluxData
10
Top of the Line Performance
Speed (and underlying techniques) similar to other top engines such as ClickHouse + DuckDB
| © Copyright 2024, InfluxData
11
Trending
Who’s who of DataFusion users
| © Copyright 2024, InfluxData
12
Architecture
Design Goals:
Results for Users
| © Copyright 2024, InfluxData
13
Use Case: File Formats (Lance)
Courtesy of Weston Pace
Decoder uses DataFusion Expr simplification to calculate zone pruning (source)
Encoder uses DataFusion aggregators to calculate min/max (source)
Lance Format
Arrow
Arrow
Lance file format uses DataFusion to implement pushdown filtering
Encoder
Decoder
| © Copyright 2024, InfluxData
14
Use Case: Table Formats
delta-rs uses DataFusion for various features (and provides a TableProvider for reading)
delta-rs / deltalake python package
write
Predicate Evaluation (in overwrite mode)
Optimize (compact)
Load
Delete
…
Z-Order evaluation
TableProvider:
Projection pushdown,
Limit pushdown
Predicate pushdown
Predicate Evaluation
Load_CDF
Custom plan nodes
for change data
deltafeeds
| © Copyright 2024, InfluxData
15
Use Case: SQL Analysis (frontend)
SDF
Assembly
Intelisense
Tests
Reports
Business Value
Catalog
BigQuery
Redshift
Metadata
Unified Logical Plan
SDF
Static Analysis
SDF Development Framework
Ingest
Deploy
Analyze
Guarantee
Snowflake
Trino/Presto
SDF uses complete ANTLR Grammars to define many SQL dialects - notably, proprietary ones like Snowflake.
All SQLs compile to a unified Intermediate Representation: the Datafusion Logical Plan.
This gives SDF Executable Semantics.
SDF’s transformation layer statically analyzes many logical plans at once for correctness and generates rich metadata.
Courtesy of SDF / Lukas Shute
| © Copyright 2024, InfluxData
16
Use Case: Execution Engine
Integration Layer with Spark
DataFusion’s ExecutionPlan Streams
Use Spark Planner / Executor machinery
| © Copyright 2024, InfluxData
17
Use Case: Specialized Database Systems
Examples
Domain Specific Language
Catalog
Custom Operators
See more: Apache DataFusion documentation
| © Copyright 2024, InfluxData
18
DataFusion: Community + Not Quite Enough Time
| © Copyright 2024, InfluxData
19
“To achieve great things, two things are needed; a plan, and not quite enough time.”
- Leonard Bernstein (according to the internet)
| © Copyright 2024, InfluxData
20
Who Controls Project / Roadmap
🤑💰
| © Copyright 2024, InfluxData
21
Community
Not started / donated by a company: founded by Andy Grove
Community:
Velocity:
* Caveat: some distortion due to tortured git history
| © Copyright 2024, InfluxData
22
“Community over Code” - The Apache Way
Non profit governance of open source communities
| © Copyright 2024, InfluxData
23
Apache: Benefits for DataFusion
| © Copyright 2024, InfluxData
24
My Personal / Professional Goal
1,000+ projects!
( Used to be a crazy number I just made up. Not so crazy anymore…)
| © Copyright 2024, InfluxData
25
Thank You: On with the Talks
| © Copyright 2024, InfluxData
26
T H A N K Y O U
| © Copyright 2024, InfluxData
27
Backup
| © Copyright 2024, InfluxData
28
DataFusion / Query Engine: Input / Output
29
Data Batches
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
Data Batches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
Catalog information:
tables, schemas, etc
| © Copyright 2024, InfluxData
29
Architecture
SQL
Front Ends
DataFrame
LogicalPlan
ExecutionPlan
Plan Representations and Rewrites
Expression Eval
Optimizations / Transformations
Optimizations / Transformations
HashAggregate
Sort
…
Execution Engine
Join
Catalog and
Data Sources
Parquet
CSV
…
Extension
Catalog / Table
Extension
Frontend
Extension
LogicalPlan Rewrite
Extension ExecutionPlan Rewrite
Extension
Stream
Extension Node
Extension Node
Streams
| © Copyright 2024, InfluxData
30