1 of 21

Apache Arrow and DataFusion: Changing the Game for Implementing Database Systems

Andrew Lamb, InfluxData

June 23, 2022

The Data Thread

2 of 21

Today: IOx Team at InfluxData; Apache Arrow PMC Member

Past life 1: Query Optimizer @ Vertica, also on Oracle DB server

Past life 2: Chief Architect + VP Engineering roles at some ML startups

3 of 21

Proliferation of Databases

3

DB

4 of 21

4

5 of 21

What is going on?

COTS β†’ Totally Custom

β€œBuy and Operate”

  • Buy software from vendors
  • Operate on your own hardware, with sysadmins

β€œBuild and Operate”

  • Write software for, and operate all components
  • Optimized for exact needs

β€œAssemble and Operate”

  • Assemble from open source technologies
  • Operate on resources in a public cloud

5

IT

FANG

βœ“

Current Trend

6 of 21

Part of a long term trend in DB Specialization

Relational

Key-Value

Timeseries

Graph

Array / Scientific

Document

Stream

Embedded / Edge

Cloud

Single-Node

Hybrid

Hadoop

Java

Json / Javascript

AWS

GCP

Azure

Apple Cloud

Transactions

Analytics

Streaming

Batch / ETL

...

Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1

Data Model

Deployment

Ecosystem

Use Case

7 of 21

What is DataFusion?

8 of 21

Implementation timeline for a new Database system

Client

API

In memory

storage

In-Memory

filter + aggregation

Durability / persistence

Metadata Catalog +

Management

Query

Language

Parser

Optimized /

Compressed storage

Execution on

Compressed

Data

Joins!

Additional Client

Languages

Outer Joins

Subquery support

More advanced analytics

Cost based optimizer

Out of core algorithms

Storage Rearrangement

Heuristic Query Planner

Arithmetic expressions

Date / time Expressions

Concurrency

Control

Data Model /

Type System

Distributed query execution

Resource

Management

β€œLets Build

a Database”

πŸ€”

β€œOk now this is pretty good”

😐

β€œLook mom!

I have a database!”

πŸ˜ƒ

Online recovery

Window functions

9 of 21

DataFusion: A Query Engine

β€œDataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.”

  • DataFusion Website

10 of 21

DataFusion: A Query Engine

SQL Query

SELECT status, COUNT(1)

FROM http_api_requests_total

WHERE path = '/api/v2/write'

GROUP BY status;

RecordBatches

DataFrame

ctx.read_table("http")?

.filter(...)?

.aggregate(..)?;

RecordBatches

Catalog information:

tables, schemas, etc

OR

11 of 21

But for Databases

πŸ€”

12 of 21

DataFusion: LLVM-like Infrastructure for Databases

SQL

Query FrontEnds

DataFrame

LogicalPlans

ExecutionPlan

Plan Representations

(DataFlow Graphs)

Expression Eval

Optimizations / Transformations

Optimizations / Transformations

HashAggregate

Sort

…

Optimized Execution Operators

(Arrow Based)

Join

Data Sources

Parquet

CSV

…

DataFusion

13 of 21

DataFusion: Totally Customizable

SQL

Query FrontEnds

DataFrame

LogicalPlans

ExecutionPlan

Plan Representations

(DataFlow Graphs)

Expression Eval

Optimizations / Transformations

Optimizations / Transformations

HashAggregate

Sort

…

Join

Data Sources

Parquet

CSV

DataFusion

Extend βœ…

Extend βœ…

Extend βœ…

Extend βœ…

Extend βœ…

Extend βœ…

Extend βœ…

Extend βœ…

Optimized Execution Operators

(Arrow Based)

14 of 21

Example Uses

15 of 21

Cube.js / Cube Store

  • Overview:
    • Headless Business Intelligence
    • Cube.js pre-aggregation storage layer.
  • Use of DataFusion (fork)
    • SQL API (with custom extensions)
    • Custom Logical and Physical Operators
    • UDFs: custom functions
    • Optimized native plan execution

15

16 of 21

InfluxDB IOx

  • Overview:
    • In-memory columnar store using object storage, future core of InfluxDB; support SQL, InfluxQL, and Flux
    • Query and data reorganization built with DataFusion
  • Use of DataFusion:
    • Table Provider: Custom data sources
    • SQL API
    • PlanBuilder API: Plans for custom query language
    • UD Logical and Execution Plans
    • UDFs: to implement the precise semantics of influxRPC
    • Optimized native plan execution

16

17 of 21

FLOCK

  • Overview:
    • Low-Cost Streaming Query Engine on FaaS Platforms
    • Project from UMD Database Group, runs streaming queries on AWS Lambda (x86 and arm64/graviton2).
  • Use of DataFusion
    • SQL API:
    • DataFrame API: To build plans
    • Optimized native plan execution

17

18 of 21

VegaFusion

  • Overview:
    • Accelerates execution of (interactive) data visualizations
    • Compiles Vega data transforms into DataFusion query plans.
  • Use of DataFusion:
    • DataFrame API: To build plans
    • UDFs: to implement some Vega expressions
    • Optimized native plan execution

18

19 of 21

We ❀️ Our Contributors

  • Active and Welcoming Community
  • Contributions at all levels are encouraged and welcomed.
  • We have Database Internals experts, novices looking for experience writing Rust, and everything in between.

20 of 21

Learn More + Join Us

Project site:

Architecture Slides

  • DataFusion: An Embeddable Query Engine Written in Rust (google slides) (slideshare)

21 of 21

Thank You

Andrew Lamb: andrew@nerdnetworks.org