Apache Arrow and DataFusion: Changing the Game for Implementing Database Systems
Today: IOx Team at InfluxData; Apache Arrow PMC Member
Past life 1: Query Optimizer @ Vertica, also on Oracle DB server
Past life 2: Chief Architect + VP Engineering roles at some ML startups
Proliferation of Databases
3
DB
4
What is going on?
COTS β Totally Custom
βBuy and Operateβ
βBuild and Operateβ
βAssemble and Operateβ
5
IT
FANG
β
Current Trend
Part of a long term trend in DB Specialization
Relational
Key-Value
Timeseries
Graph
Array / Scientific
Document
Stream
Embedded / Edge
Cloud
Single-Node
Hybrid
Hadoop
Java
Json / Javascript
AWS
GCP
Azure
Apple Cloud
Transactions
Analytics
Streaming
Batch / ETL
...
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2β11. DOI:https://doi.org/10.1109/ICDE.2005.1
Data Model
Deployment
Ecosystem
Use Case
What is DataFusion?
Implementation timeline for a new Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability / persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer Joins
Subquery support
More advanced analytics
Cost based optimizer
Out of core algorithms
Storage Rearrangement
Heuristic Query Planner
Arithmetic expressions
Date / time Expressions
Concurrency
Control
Data Model /
Type System
Distributed query execution
Resource
Management
βLets Build
a Databaseβ
π€
βOk now this is pretty goodβ
π
βLook mom!
I have a database!β
π
Online recovery
Window functions
DataFusion: A Query Engine
βDataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.β
DataFusion: A Query Engine
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
RecordBatches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
RecordBatches
Catalog information:
tables, schemas, etc
OR
But for Databases
π€
DataFusion: LLVM-like Infrastructure for Databases
SQL
Query FrontEnds
DataFrame
LogicalPlans
ExecutionPlan
Plan Representations
(DataFlow Graphs)
Expression Eval
Optimizations / Transformations
Optimizations / Transformations
HashAggregate
Sort
β¦
Optimized Execution Operators
(Arrow Based)
Join
Data Sources
Parquet
CSV
β¦
DataFusion
DataFusion: Totally Customizable
SQL
Query FrontEnds
DataFrame
LogicalPlans
ExecutionPlan
Plan Representations
(DataFlow Graphs)
Expression Eval
Optimizations / Transformations
Optimizations / Transformations
HashAggregate
Sort
β¦
Join
Data Sources
Parquet
CSV
DataFusion
Extend β
Extend β
Extend β
Extend β
Extend β
Extend β
Extend β
Extend β
Optimized Execution Operators
(Arrow Based)
Example Uses
Cube.js / Cube Store
15
InfluxDB IOx
16
FLOCK
17
VegaFusion
18
We β€οΈ Our Contributors
Learn More + Join Us
Project site:
Architecture Slides
Thank You
Andrew Lamb: andrew@nerdnetworks.org