1 of 30

Intro to DataFusion: Technology, Community, and Not Quite Enough Time

Andrew Lamb | Staff Engineer, InfluxData

January 15, 2025, Boston DataFusion Meetup

| © Copyright 2024, InfluxData

1

2 of 30

Outline

  • Acknowledgements
  • Agenda
  • Intro to DataFusion

| © Copyright 2024, InfluxData

2

3 of 30

Thank you to our Sponsors

Room and facilities

(thanks Justin & Gladys!)

| © Copyright 2024, InfluxData

3

4 of 30

Thank you to our Sponsors

Food

| © Copyright 2024, InfluxData

4

5 of 30

Talks

6:00 - 6:20: Intro to DataFusion: Technology, Community, and Not Quite Enough Time

Andrew Lamb is a Staff Engineer at InfluxData, working on InfluxDB 3.0, focused on query processing, the Apache DataFusion query engine and the Apache Arrow ecosystem. He is a member of the Apache Software Foundation and the Apache DataFusion PMC (2024 Chair), and Apache Arrow PMC (2023 Chair) (LinkedIn).

6:20 - 6:40: Embucket: a Snowflake-compatible lakehouse based on DataFusion

Camuel Gilyadov is Founder and CEO of Embucket (LinkedIn)

6:40 - 7:00: Stateless engines, table formats, and the future

Jake Thomas is Manager, Data Foundations at Okta (LinkedIn)

6:40 - 7:00: Building InfluxDB 3.0 with DataFusion

Andrew Lamb

Update: Jake is unable to make it tonight 🤧

| © Copyright 2024, InfluxData

5

6 of 30

DataFusion: Technology

| © Copyright 2024, InfluxData

6

7 of 30

Top Level Project, Apache Software Foundation

Apache 2.0 Licensed

| © Copyright 2024, InfluxData

7

8 of 30

Analogy: DataFusion is LLVM for Databases

Clang

Rust

Swift

C/C++ frontend

LLVM

Rustlang frontend

LLVM

Swift frontend

LLVM

LLVM enabled innovation in programming languages:

  • High quality reusable optimizer, code generator, debugger, lsp integration, etc.
  • Focus on language design, ecosystem, libraries, etc

| © Copyright 2024, InfluxData

8

9 of 30

Analogy: DataFusion is LLVM for Databases

Analytic Application

Domain Specific Language

Specialized Database

Application Logic

Catalog

Analysis Engine

Multiple SQL Dialects

Data Flow Analysis

Custom Operators

File System Interface

DataFusion enables innovation in data intensive systems

  • High quality reusable SQL planner, optimizer, function library, vectorized operators, etc
  • Focus on language design, data management, use case specific features

| © Copyright 2024, InfluxData

9

10 of 30

Recognized Tech

Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine

In SIGMOD 2024

We wrote this paper entirely in the open

| © Copyright 2024, InfluxData

10

11 of 30

Top of the Line Performance

Speed (and underlying techniques) similar to other top engines such as ClickHouse + DuckDB

| © Copyright 2024, InfluxData

11

12 of 30

Trending

Who’s who of DataFusion users

| © Copyright 2024, InfluxData

12

13 of 30

Architecture

Design Goals:

  • Work “out of the box” (fast time to awesome)
  • Customize everything via APIs
  • Architecturally Boring 🥱 (“Industrial best practice”)

Results for Users

  • Quickly start with a basic, high-performance engine
  • Specialize to suit their needs and available engineering capacity
  • Easy to try out new ideas (operators, rewrites, etc)

| © Copyright 2024, InfluxData

13

14 of 30

Use Case: File Formats (Lance)

Courtesy of Weston Pace

Decoder uses DataFusion Expr simplification to calculate zone pruning (source)

Encoder uses DataFusion aggregators to calculate min/max (source)

Lance Format

Arrow

Arrow

Lance file format uses DataFusion to implement pushdown filtering

Encoder

Decoder

| © Copyright 2024, InfluxData

14

15 of 30

Use Case: Table Formats

delta-rs uses DataFusion for various features (and provides a TableProvider for reading)

delta-rs / deltalake python package

write

Predicate Evaluation (in overwrite mode)

Optimize (compact)

Load

Delete

Z-Order evaluation

  • sort

TableProvider:

Projection pushdown,

Limit pushdown

Predicate pushdown

Predicate Evaluation

Load_CDF

Custom plan nodes

for change data

deltafeeds

| © Copyright 2024, InfluxData

15

16 of 30

Use Case: SQL Analysis (frontend)

SDF

Assembly

Intelisense

Tests

Reports

Business Value

Catalog

BigQuery

Redshift

Metadata

Unified Logical Plan

SDF

Static Analysis

SDF Development Framework

Ingest

Deploy

Analyze

Guarantee

Snowflake

Trino/Presto

SDF uses complete ANTLR Grammars to define many SQL dialects - notably, proprietary ones like Snowflake.

All SQLs compile to a unified Intermediate Representation: the Datafusion Logical Plan.

This gives SDF Executable Semantics.

SDF’s transformation layer statically analyzes many logical plans at once for correctness and generates rich metadata.

Courtesy of SDF / Lukas Shute

| © Copyright 2024, InfluxData

16

17 of 30

Use Case: Execution Engine

Integration Layer with Spark

DataFusion’s ExecutionPlan Streams

Use Spark Planner / Executor machinery

| © Copyright 2024, InfluxData

17

18 of 30

Use Case: Specialized Database Systems

Examples

Domain Specific Language

Catalog

Custom Operators

| © Copyright 2024, InfluxData

18

19 of 30

DataFusion: Community + Not Quite Enough Time

| © Copyright 2024, InfluxData

19

20 of 30

“To achieve great things, two things are needed; a plan, and not quite enough time.”

- Leonard Bernstein (according to the internet)

| © Copyright 2024, InfluxData

20

21 of 30

Who Controls Project / Roadmap

  • DataFusion development is NOT directly funded
    • Users contribute together (including engineers paid by companies who build using DataFusion)
  • ⇒ Always a bit short on time, especially:
    • Reviews, Documentation, Ticket Triage
  • Pro: forced prioritization

Plug: Join us to have fun and have a say 🎣

🤑💰

| © Copyright 2024, InfluxData

21

22 of 30

Community

Not started / donated by a company: founded by Andy Grove

Community:

Velocity:

  • Monthly releases for the last 3 years
  • Many commits a day

* Caveat: some distortion due to tortured git history

| © Copyright 2024, InfluxData

22

23 of 30

“Community over Code” - The Apache Way

Non profit governance of open source communities

| © Copyright 2024, InfluxData

23

24 of 30

Apache: Benefits for DataFusion

  • ⇒ Predictable Foundation
  • Stable License: (ASL 20 years old) low risk of changes, (ahem OpenTofu)
  • Communication: Predictable and open (if slow)
  • Multi-Vendor Participation: Shared investment reduces individual risk
  • Long Term Maintenance: Hedged against life changes, corporate strategy shifts, VC funding cycles
  • ⭐⭐⭐⭐⭐: Works far better than could be reasonably expected

| © Copyright 2024, InfluxData

24

25 of 30

My Personal / Professional Goal

1,000+ projects!

( Used to be a crazy number I just made up. Not so crazy anymore…)

| © Copyright 2024, InfluxData

25

26 of 30

Thank You: On with the Talks

| © Copyright 2024, InfluxData

26

27 of 30

T H A N K Y O U

| © Copyright 2024, InfluxData

27

28 of 30

Backup

| © Copyright 2024, InfluxData

28

29 of 30

DataFusion / Query Engine: Input / Output

29

Data Batches

SQL Query

SELECT status, COUNT(1)

FROM http_api_requests_total

WHERE path = '/api/v2/write'

GROUP BY status;

Data Batches

DataFrame

ctx.read_table("http")?

.filter(...)?

.aggregate(..)?;

Catalog information:

tables, schemas, etc

| © Copyright 2024, InfluxData

29

30 of 30

Architecture

SQL

Front Ends

DataFrame

LogicalPlan

ExecutionPlan

Plan Representations and Rewrites

Expression Eval

Optimizations / Transformations

Optimizations / Transformations

HashAggregate

Sort

Execution Engine

Join

Catalog and

Data Sources

Parquet

CSV

Extension

Catalog / Table

Extension

Frontend

Extension

LogicalPlan Rewrite

Extension ExecutionPlan Rewrite

Extension

Stream

Extension Node

Extension Node

Streams

| © Copyright 2024, InfluxData

30