1 of 27

Intro to DataFusion: Technology, Community, and Not Quite Enough Time

x

Andrew Lamb

Staff Engineer, InfluxData

Apache DataFusion PMC Chair

2 of 27

DataFusion: Technology

3 of 27

Top Level Project, Apache Software Foundation

Apache 2.0 Licensed

4 of 27

Analogy: DataFusion is LLVM for Databases

Clang

Rust

Swift

C/C++ frontend

LLVM

Rustlang frontend

LLVM

Swift frontend

LLVM

LLVM enabled innovation in programming languages:

  • High quality reusable optimizer, code generator, debugger, lsp integration, etc.
  • Focus on language design, ecosystem, libraries, etc

5 of 27

Analogy: DataFusion is LLVM for Databases

Analytic Application

Domain Specific Language

Specialized Database

Application Logic

Catalog

Analysis Engine

Multiple SQL Dialects

Data Flow Analysis

Custom Operators

File System Interface

DataFusion enables innovation in data intensive systems

  • High quality reusable SQL planner, optimizer, function library, vectorized operators, etc
  • Focus on language design, data management, use case specific features

6 of 27

Architecture

SQL

Front Ends

DataFrame

LogicalPlan

ExecutionPlan

Plan Representations and Rewrites

Expression Eval

Optimizations / Transformations

Optimizations / Transformations

HashAggregate

Sort

Execution Engine

Join

Catalog and

Data Sources

Parquet

CSV

Extension

Catalog / Table

Extension

Frontend

Extension

LogicalPlan Rewrite

Extension ExecutionPlan Rewrite

Extension

Stream

Extension Node

Extension Node

Streams

7 of 27

Architecture

Design Goals:

  • Work “out of the box” (fast time to awesome)
  • Customize everything via APIs
  • Architecturally Boring 🥱 (“Industrial best practice”)

Results for Users

  • Quickly start with a basic, high-performance engine

Specialize to suit their needs and available engineering capacity

Easy to try out new ideas (operators, rewrites, etc)

8 of 27

Top of the Line Performance

Speed (and underlying techniques) similar to other top engines such as ClickHouse + DuckDB

9 of 27

Top of the Line Performance (is fleeting!)

ClickBench Results as of June 2, 2025

We basically know what is needed, but need help 🎣

10 of 27

My Personal / Professional Goal

1,000+ projects!

(Used to be a crazy number I just made up.

Not so crazy anymore…)

11 of 27

Apache DataFusion Powered Products

12 of 27

Recognized Tech

Apache Arrow DataFusion:

A Fast, Embeddable, Modular Analytic Query Engine

13 of 27

Trending

Who’s who of DataFusion users

14 of 27

Use Case: Specialized Database Systems

Examples

Domain Specific Language

Catalog

Custom Operators

15 of 27

Use Case: Execution Engine

Integration Layer with Spark

DataFusion’s ExecutionPlan Streams

Use Spark Planner / Executor machinery

16 of 27

Use Case: File Formats (Lance)

Encoder

Decoder

Courtesy of Weston Pace

Decoder uses DataFusion Expr simplification to calculate zone pruning (source)

Encoder uses DataFusion aggregators to calculate min/max (source)

Lance Format

Arrow

Lance file format uses DataFusion to implement pushdown filtering

Arrow

17 of 27

Use Case: Table Formats

delta-rs uses DataFusion for various features (and provides

a TableProvider for reading)

v

delta-rs / deltalake python package

Write

Predicate Evaluation (in overwrite mode)

Optimize (compact)

Load

Delete

Z-Order evaluation

  • sort

TableProvider:

Projection pushdown,

Limit pushdown

Predicate pushdown

Predicate Evaluation

Load_CDF

Custom plan nodes

for change data

deltafeeds

18 of 27

Use Case: SQL Analysis (frontend)

SDF

Assembly

Intelisense

Tests

Reports

Business Value

Catalog

BigQuery

Redshift

Metadata

Unified Logical Plan

SDF

Static Analysis

SDF Development Framework

Ingest

Deploy

Analyze

Guarantee

Snowflake

Trino/Presto

SDF uses complete ANTLR Grammars to define many SQL dialects - notably, proprietary ones like Snowflake.

All SQLs compile to a unified Intermediate Representation: the Datafusion Logical Plan.

This gives SDF Executable Semantics.

SDF’s transformation layer statically analyzes many logical plans at once for correctness and generates rich metadata.

Courtesy of SDF / Lukas Shute

19 of 27

DataFusion: Community + Not Quite Enough Time

20 of 27

“Community over Code” - The Apache Way

Non profit governance of open source communities

21 of 27

Apache: Benefits for DataFusion

  • ⇒ Predictable Foundation
  • Stable License: (ASL 20 years old) low risk of changes, (ahem OpenTofu)
  • Communication: Predictable and open (if slow)
  • Multi-Vendor Participation: Shared investment reduces individual risk
  • Long Term Maintenance: Hedged against life changes, corporate strategy shifts, VC funding cycles
  • ⭐⭐⭐⭐⭐: Works far better than could be reasonably expected

22 of 27

Community

Not started / donated by a company:

  • Donated to the ASF by Andy Grove in 2019

Community:

Velocity:

  • Monthly releases for the last 3 years
  • Multiple commits a day (😅 still!)

* Caveat: some distortion due to tortured git history

23 of 27

Project / Roadmap

  • Development NOT directly funded
    • Users contribute (including engineers paid by companies who build using DataFusion)
    • No one paid full time to work on DataFusion (nor a research lab full of Database Experts)
  • Roadmap is determined by those who contribute

💰

24 of 27

Join the community, hack on Databases

It is a pretty great way to get a chance to work on Database Internals

⇒ Always short on time, especially:

  • Reviews, Documentation, Triage

25 of 27

Thank You:

On with the Talks

26 of 27

Backup

27 of 27

DataFusion / Query Engine: Input / Output

Data Batches

SQL Query

SELECT status, COUNT(1)

FROM http_api_requests_total

WHERE path = '/api/v2/write'

GROUP BY status;

DataFrame

ctx.read_table("http")?

.filter(...)?

.aggregate(..)?;

Catalog information:

tables, schemas, etc

Data Batches