1 of 21

Apache Arrow and DataFusion: Changing the Game for Implementing Database Systems

Andrew Lamb, InfluxData

June 23, 2022

The Data Thread

2 of 21

Today: IOx Team at InfluxData; Apache Arrow PMC Member

Past life 1: Query Optimizer @ Vertica, also on Oracle DB server

Past life 2: Chief Architect + VP Engineering roles at some ML startups

3 of 21

Proliferation of Databases

3

DB

4 of 21

4

5 of 21

What is going on?

COTS → Totally Custom

“Buy and Operate”

Buy software from vendors
Operate on your own hardware, with sysadmins

“Build and Operate”

Write software for, and operate all components
Optimized for exact needs

“Assemble and Operate”

Assemble from open source technologies
Operate on resources in a public cloud

5

IT

FANG

✓

Current Trend

6 of 21

Part of a long term trend in DB Specialization

Relational

Key-Value

Timeseries

Graph

Array / Scientific

Document

Stream

Embedded / Edge

Cloud

Single-Node

Hybrid

Hadoop

Java

Json / Javascript

AWS

GCP

Azure

Apple Cloud

Transactions

Analytics

Streaming

Batch / ETL

...

Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1

Data Model

Deployment

Ecosystem

Use Case

7 of 21

What is DataFusion?

8 of 21

Implementation timeline for a new Database system

Client

API

In memory

storage

In-Memory

filter + aggregation

Durability / persistence

Metadata Catalog +

Management

Query

Language

Parser

Optimized /

Compressed storage

Execution on

Compressed

Data

Joins!

Additional Client

Languages

Outer Joins

Subquery support

More advanced analytics

Cost based optimizer

Out of core algorithms

Storage Rearrangement

Heuristic Query Planner

Arithmetic expressions

Date / time Expressions

Concurrency

Control

Data Model /

Type System

Distributed query execution

Resource

Management

“Lets Build

a Database”

🤔

“Ok now this is pretty good”

😐

“Look mom!

I have a database!”

😃

Online recovery

Window functions

9 of 21

DataFusion: A Query Engine

“DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.”

DataFusion Website

10 of 21

DataFusion: A Query Engine

SQL Query

SELECT status, COUNT(1)

FROM http_api_requests_total

WHERE path = '/api/v2/write'

GROUP BY status;

RecordBatches

DataFrame

ctx.read_table("http")?

.filter(...)?

.aggregate(..)?;

RecordBatches

Catalog information:

tables, schemas, etc

OR

11 of 21

But for Databases

🤔

12 of 21

DataFusion: LLVM-like Infrastructure for Databases

SQL

Query FrontEnds

DataFrame

LogicalPlans

ExecutionPlan

Plan Representations

(DataFlow Graphs)

Expression Eval

Optimizations / Transformations

HashAggregate

Sort

…

Optimized Execution Operators

(Arrow Based)

Join

Data Sources

Parquet

CSV

…

DataFusion

13 of 21

DataFusion: Totally Customizable

SQL

Query FrontEnds

DataFrame

LogicalPlans

ExecutionPlan

Plan Representations

(DataFlow Graphs)

Expression Eval

Optimizations / Transformations

HashAggregate

Sort

…

Join

Data Sources

Parquet

CSV

DataFusion

Extend ✅

Optimized Execution Operators

(Arrow Based)

14 of 21

Example Uses

15 of 21

Cube.js / Cube Store

https://cube.dev/

Overview:

Headless Business Intelligence
Cube.js pre-aggregation storage layer.

Use of DataFusion (fork)

SQL API (with custom extensions)
Custom Logical and Physical Operators
UDFs: custom functions
Optimized native plan execution

15

16 of 21

InfluxDB IOx

https://github.com/influxdata/influxdb_iox

Overview:

In-memory columnar store using object storage, future core of InfluxDB; support SQL, InfluxQL, and Flux
Query and data reorganization built with DataFusion

Use of DataFusion:

Table Provider: Custom data sources
SQL API
PlanBuilder API: Plans for custom query language
UD Logical and Execution Plans
UDFs: to implement the precise semantics of influxRPC
Optimized native plan execution

16

17 of 21

FLOCK

https://github.com/flock-lab/flock

Overview:

Low-Cost Streaming Query Engine on FaaS Platforms
Project from UMD Database Group, runs streaming queries on AWS Lambda (x86 and arm64/graviton2).

Use of DataFusion

SQL API:
DataFrame API: To build plans
Optimized native plan execution

17

18 of 21

VegaFusion

https://vegafusion.io/

Overview:

Accelerates execution of (interactive) data visualizations
Compiles Vega data transforms into DataFusion query plans.

Use of DataFusion:

DataFrame API: To build plans
UDFs: to implement some Vega expressions
Optimized native plan execution

18

19 of 21

We ❤️ Our Contributors

Active and Welcoming Community
Contributions at all levels are encouraged and welcomed.
We have Database Internals experts, novices looking for experience writing Rust, and everything in between.

20 of 21

Learn More + Join Us

Project site:

Architecture Slides

DataFusion: An Embeddable Query Engine Written in Rust (google slides) (slideshare)

21 of 21

Thank You

Andrew Lamb: andrew@nerdnetworks.org