Intro to DataFusion: Technology, Community, and Not Quite Enough Time
x
June 9, 2025: Apache DataFusion Meetup 2.0, San Francisco
Andrew Lamb
Staff Engineer, InfluxData
Apache DataFusion PMC Chair
DataFusion: Technology
Top Level Project, Apache Software Foundation
Apache 2.0 Licensed
Analogy: DataFusion is LLVM for Databases
Clang
Rust
Swift
C/C++ frontend
LLVM
Rustlang frontend
LLVM
Swift frontend
LLVM
…
LLVM enabled innovation in programming languages:
Analogy: DataFusion is LLVM for Databases
Analytic Application
Domain Specific Language
Specialized Database
Application Logic
Catalog
Analysis Engine
Multiple SQL Dialects
Data Flow Analysis
Custom Operators
File System Interface
…
DataFusion enables innovation in data intensive systems
Architecture
SQL
Front Ends
DataFrame
LogicalPlan
ExecutionPlan
Plan Representations and Rewrites
Expression Eval
Optimizations / Transformations
Optimizations / Transformations
HashAggregate
Sort
…
Execution Engine
Join
Catalog and
Data Sources
Parquet
CSV
…
Extension
Catalog / Table
Extension
Frontend
Extension
LogicalPlan Rewrite
Extension ExecutionPlan Rewrite
Extension
Stream
Extension Node
Extension Node
Streams
Architecture
Design Goals:
Results for Users
Specialize to suit their needs and available engineering capacity
Easy to try out new ideas (operators, rewrites, etc)
Top of the Line Performance
Speed (and underlying techniques) similar to other top engines such as ClickHouse + DuckDB
Top of the Line Performance (is fleeting!)
ClickBench Results as of June 2, 2025
My Personal / Professional Goal
1,000+ projects!
(Used to be a crazy number I just made up.
Not so crazy anymore…)
Apache DataFusion Powered Products
Recognized Tech
Apache Arrow DataFusion:
A Fast, Embeddable, Modular Analytic Query Engine
Trending
Who’s who of DataFusion users
Use Case: Specialized Database Systems
Examples
Domain Specific Language
Catalog
Custom Operators
See more: Apache DataFusion documentation
Use Case: Execution Engine
Integration Layer with Spark
DataFusion’s ExecutionPlan Streams
Use Spark Planner / Executor machinery
Use Case: File Formats (Lance)
Encoder
Decoder
Courtesy of Weston Pace
Decoder uses DataFusion Expr simplification to calculate zone pruning (source)
Encoder uses DataFusion aggregators to calculate min/max (source)
Lance Format
Arrow
Lance file format uses DataFusion to implement pushdown filtering
Arrow
Use Case: Table Formats
delta-rs uses DataFusion for various features (and provides
a TableProvider for reading)
v
delta-rs / deltalake python package
Write
Predicate Evaluation (in overwrite mode)
Optimize (compact)
Load
Delete
…
Z-Order evaluation
TableProvider:
Projection pushdown,
Limit pushdown
Predicate pushdown
Predicate Evaluation
Load_CDF
Custom plan nodes
for change data
deltafeeds
Use Case: SQL Analysis (frontend)
SDF
Assembly
Intelisense
Tests
Reports
Business Value
Catalog
BigQuery
Redshift
Metadata
Unified Logical Plan
SDF
Static Analysis
SDF Development Framework
Ingest
Deploy
Analyze
Guarantee
Snowflake
Trino/Presto
SDF uses complete ANTLR Grammars to define many SQL dialects - notably, proprietary ones like Snowflake.
All SQLs compile to a unified Intermediate Representation: the Datafusion Logical Plan.
This gives SDF Executable Semantics.
SDF’s transformation layer statically analyzes many logical plans at once for correctness and generates rich metadata.
Courtesy of SDF / Lukas Shute
DataFusion: Community + Not Quite Enough Time
“Community over Code” - The Apache Way
Non profit governance of open source communities
Apache: Benefits for DataFusion
Community
Not started / donated by a company:
Community:
Velocity:
* Caveat: some distortion due to tortured git history
Project / Roadmap
💰
Join the community, hack on Databases
It is a pretty great way to get a chance to work on Database Internals
⇒ Always short on time, especially:
Thank You:
On with the Talks
Backup
DataFusion / Query Engine: Input / Output
Data Batches
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
Catalog information:
tables, schemas, etc
Data Batches