1
A leading Data & AI Consultancy in the Netherlands
Define data &
AI strategies
Build data
platforms and
data infrastructure
Develop data &
AI solutions
…to execution
From strategy…
Analytics Translators
Analytics Engineers�Data Scientists�ML Engineers
Data Engineers
Data Architects
2
Open positions |
Account managers |
Support Engineer |
Analytics Engineers |
Data Engineers |
ML Engineers |
Data Scientists |
Data Solutions Architect |
3
4
Intro to DataFusion: Technology, Community, and Not Quite Enough Time
Andrew Lamb
Staff Engineer, InfluxData
January 23, 2025, Data & Drinks: Building Next-Gen Data Systems with Apache DataFusion
| © Copyright 2024, InfluxData
5
DataFusion: Technology
| © Copyright 2024, InfluxData
6
Top Level Project, Apache Software Foundation
Apache 2.0 Licensed
| © Copyright 2024, InfluxData
7
Analogy: DataFusion is LLVM for Databases
Clang
Rust
Swift
C/C++ frontend
LLVM
Rustlang frontend
LLVM
Swift frontend
LLVM
…
LLVM enabled innovation in programming languages:
| © Copyright 2024, InfluxData
8
Analogy: DataFusion is LLVM for Databases
Analytic Application
Domain Specific Language
Specialized Database
Application Logic
Catalog
Analysis Engine
Multiple SQL Dialects
Data Flow Analysis
Custom Operators
File System Interface
…
DataFusion enables innovation in data intensive systems
| © Copyright 2024, InfluxData
9
Architecture
SQL
Front Ends
DataFrame
LogicalPlan
ExecutionPlan
Plan Representations and Rewrites
Expression Eval
Optimizations / Transformations
Optimizations / Transformations
HashAggregate
Sort
…
Execution Engine
Join
Catalog and
Data Sources
Parquet
CSV
…
Extension
Catalog / Table
Extension
Frontend
Extension
LogicalPlan Rewrite
Extension ExecutionPlan Rewrite
Extension
Stream
Extension Node
Extension Node
Streams
| © Copyright 2024, InfluxData
10
Recognized Tech
| © Copyright 2024, InfluxData
11
Top of the Line Performance
Speed (and underlying techniques) similar to other top engines such as ClickHouse + DuckDB
| © Copyright 2024, InfluxData
12
Top of the Line Performance (is fleeting!)
ClickBench Results as of Jan 23, 2025
Looking for help rerunning ClickBench on 44 (and 45!)
Update ClickBench benchmarks with DataFusion 44.0.0 #13983
| © Copyright 2024, InfluxData
13
Trending
Who’s who of DataFusion users
| © Copyright 2024, InfluxData
14
Architecture
Design Goals:
Results for Users
| © Copyright 2024, InfluxData
15
Use Case: File Formats (Lance)
Courtesy of Weston Pace
Decoder uses DataFusion Expr simplification to calculate zone pruning (source)
Encoder uses DataFusion aggregators to calculate min/max (source)
Lance Format
Arrow
Arrow
Lance file format uses DataFusion to implement pushdown filtering
Encoder
Decoder
| © Copyright 2024, InfluxData
16
Use Case: Table Formats
delta-rs uses DataFusion for various features (and provides a TableProvider for reading)
delta-rs / deltalake python package
write
Predicate Evaluation (in overwrite mode)
Optimize (compact)
Load
Delete
…
Z-Order evaluation
TableProvider:
Projection pushdown,
Limit pushdown
Predicate pushdown
Predicate Evaluation
Load_CDF
Custom plan nodes
for change data
deltafeeds
| © Copyright 2024, InfluxData
17
Use Case: SQL Analysis (frontend)
SDF
Assembly
Intelisense
Tests
Reports
Business Value
Catalog
BigQuery
Redshift
Metadata
Unified Logical Plan
SDF
Static Analysis
SDF Development Framework
Ingest
Deploy
Analyze
Guarantee
Snowflake
Trino/Presto
SDF uses complete ANTLR Grammars to define many SQL dialects - notably, proprietary ones like Snowflake.
All SQLs compile to a unified Intermediate Representation: the Datafusion Logical Plan.
This gives SDF Executable Semantics.
SDF’s transformation layer statically analyzes many logical plans at once for correctness and generates rich metadata.
Courtesy of SDF / Lukas Shute
| © Copyright 2024, InfluxData
18
Use Case: Execution Engine
Integration Layer with Spark
DataFusion’s ExecutionPlan Streams
Use Spark Planner / Executor machinery
| © Copyright 2024, InfluxData
19
Use Case: Specialized Database Systems
Examples
Domain Specific Language
Catalog
Custom Operators
See more: Apache DataFusion documentation
| © Copyright 2024, InfluxData
20
DataFusion: Community + Not Quite Enough Time
| © Copyright 2024, InfluxData
21
“To achieve great things, two things are needed; a plan, and not quite enough time.”
- Leonard Bernstein (according to the internet)
| © Copyright 2024, InfluxData
22
Who Controls Project / Roadmap
🤑💰
| © Copyright 2024, InfluxData
23
Community
Not started / donated by a company: founded by Andy Grove
Community:
Velocity:
* Caveat: some distortion due to tortured git history
| © Copyright 2024, InfluxData
24
“Community over Code” - The Apache Way
Non profit governance of open source communities
| © Copyright 2024, InfluxData
25
Apache: Benefits for DataFusion
| © Copyright 2024, InfluxData
26
My Personal / Professional Goal
1,000+ projects!
( Used to be a crazy number I just made up. Not so crazy anymore…)
| © Copyright 2024, InfluxData
27
Thank You: On with the Talks
| © Copyright 2024, InfluxData
28
T H A N K Y O U
| © Copyright 2024, InfluxData
29
Backup
| © Copyright 2024, InfluxData
30
DataFusion / Query Engine: Input / Output
31
Data Batches
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
Data Batches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
Catalog information:
tables, schemas, etc
| © Copyright 2024, InfluxData
31