1 of 55

Building InfluxDB 3.0

With Apache Arrow, DataFusion, Flight, Parquet

Andrew Lamb | Staff Engineer, InfluxData

1

2 of 55

Andrew Lamb

Staff Engineer InfluxData

> 20 21 😱years in enterprise software development

Oracle: Database (2 years)

DataPower: XSLT compiler (2 years)

Vertica: DB / Query Optimizer (6 years)

Nutonian/DataRobot: ML Startups (7 years)

InfluxData: InfluxDB 3.0, Arrow, DataFusion (4 years)

i n f l u x d a t a . c o m

2

3 of 55

Goals

Convince you, via example, that:

Databases of the future will be assembled from reusable components
This is not a terrible idea

Talk outline:

Database Implementation Trends
Time Series Databases, need for (yet another) one
FDAP: Flight, DataFusion, Arrow, Parquet: Rationale + Use InfluxDB 3.0

Blog: Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0

3

4 of 55

Thesis: Long Term Trends in Databases

Time

Adoption

Ease of Building

3.0

4

5 of 55

“One Size Fits All”

An Idea Whose Time Has Come and Gone

Stonebraker & Çetintemel (2005)

5

6 of 55

Case Study: In which InfluxData decides to build a new time series database.

6

7 of 55

What / Why of Time Series Databases

Specialized for storing data with times (obviously).

Key Properties:

High volume, denormalized ingest

eg. host=myhost123.example.com repeated over and over

Low latency query after load: milliseconds between ingest and query
Schema on write: new columns can appear at any time, backfills
Rapid data value decay: newest data is super important, falls off drastically

Examples:

Open → InfluxDB, Timescale, Graphite Whisper, VictoriaMetrics

Closed → Facebook Gorilla, Google Monarch, AWS Timestream, DataDog Husky

7

8 of 55

InfluxDB 3.0 Requirements

Need: No series cardinality limits:

Tech: TSM (LSM Tree / KV Store) → Column Store

Need: ‘Infinite’ Retention

Tech: TSM on locally attached disks → object store (cheap!)

Need: Elastic Scalability

Tech: Shared Nothing (local disk)→ Disaggregated Storage (S3)

Need: Ecosystem Compatibility

Tech: InfluxQL / custom APIs → SQL + JDBC/ODBC

8

9 of 55

So, let’s build a new Database

(How hard could that be, really?)

9

10 of 55

It is hard (expensive) to build a new databases

Database company money raised

Snowflake: $2B
Databricks (Spark) $3.5B
MongoDB: $311M
SingleStore: $464.1M
CockroachLabs (CockroachDB): $633.1M
Pingcap (TiDB): $341.6M
Elastic: $162M
TimescaleDB: $181M
DuckDB $? / MotherDuck: $47.5M $100M

Source: https://www.crunchbase.com

I did this at Vertica too

10

11 of 55

“We can do it with the Apache Arrow Ecosystem”

https://www.influxdata.com/blog/apache-arrow-parquet-flight-and-their-ecosystem-are-a-game-changer-for-olap/

Evan (CEO)

Paul (CTO)

11

12 of 55

Benefits of building on FDAP foundation

⇒ Innovation on Time Series, FDAP provides standard (lengthy to build) pieces

match tool_needed_for_database {

File format (persistence) => Parquet,

Columnar memory representation => Arrow Arrays,

Operations (e.g. multiply, avg) => Compute Kernels,

SQL + extensible query engine => Arrow DataFusion,

Network transfer => Arrow Flight,

JDBC/ODBC driver => Arrow FlightSQL,

}

12

13 of 55

Let’s do it

13

14 of 55

InfluxDB 3.0 Architecture

14

15 of 55

InfluxDB 3.0: Fast, Real Time & Cost Effective

15

16 of 55

“Community over Code” - The Apache Way

Non profit governance of open source communities

16

17 of 55

Apache: Benefits for InfluxDB 3.0

⇒ Predictable Foundation
Stable License: (ASL 20 years old) low risk of changes, (ahem OpenTofu)
Communication: Predictable and open (if slow)
Multi-Vendor Participation: Shared investment reduces individual risk
Long Term Maintenance: Hedged against life changes, corporate strategy shifts, VC funding cycles
⭐⭐⭐⭐⭐: Works far better than could be reasonably expected

17

18 of 55

Columnar file format from 2013 (originally part of Hadoop ecosystem)

https://parquet.apache.org/

18

19 of 55

Apache Parquet: Benefits for InfluxDB

⇒ Avoid Engineering overhead of building a custom file format + ecosystem

Compression: Works well with wide data variety (including time series)

Performance: Techniques such as projection, filter pushdown, late materialization

Interoperability: “Defacto” interchange format for analytics, immediate compatibility with 1000s of tools (Apache Iceberg makes it even easier)

19

20 of 55

TSM (time series) vs Parquet Compression

Entire dataset (Parquet/TSM: 20%)

Format	File Count	Size (GB)
TSM	483	591
TSM (gzip)	483	97
Parquet	246,140 (4K - 2.2GB)	118

Format	File Count	Size (GB)
TSM	1	.110
Parquet	54 (4K - 7M)	.020

File 1 (Parquet/TSM: 18%)

Format	File Count	Size (GB)
TSM	1	2
Parquet	974 (12K - 53M)	.226

File 2 (Parquet/TSM 11%)

20

21 of 55

Parquet Organization

Source: https://www.influxdata.com/blog/querying-parquet-millisecond-latency/

A	B	C
15	Foo	1/1/2023
…
…
11	Bar	..1/5/2023
50	Baz	1/1/2023
…
…
32	Blarg	1/6/2023

(“PAX” in DB literature)

21

22 of 55

Parquet Structure + Metadata

Source: https://www.influxdata.com/blog/querying-parquet-millisecond-latency/

…

Metadata footer

Highly encoded / compressed pages

Footer contains location of pages, and statistics such as min/max/count/nullcount.

(“Zone Maps”, “Small Materialized Aggregates” in DB literature)

22

23 of 55

Parquet Projection + Filter Pushdown

Source: https://www.influxdata.com/blog/querying-parquet-millisecond-latency/

…

Metadata footer

Highly encoded / compressed pages

Metadata + query to prune (skip) pages that aren’t needed

SELECT A

...

WHERE C > 25

1. Consult metadata

2. Only read/decode necessary pages

23

24 of 55

How does InfluxDB 3.0 use Parquet?

Durable Store: All data durably persisted to object store as parquet;

Query: Read from parquet files + latest unpersisted ingester data

Object Store

Ingester

2. Periodically writes data buffer as sorted parquet files to object store

Ingester

…

write path

read path

Querier

4. Reads (and caches) parquet data from object store to answers queries

SELECT … FROM ...

3. User query

1. Incoming line protocol

weather,location=us-east temperature=82,humidity=67 1465839830100400200�weather,location=us-midwest temperature=82,humidity=65 1465839830100400200�weather,location=us-west temperature=70,humidity=54 1465839830100400200�weather,location=us-east temperature=83,humidity=69 1465839830200400200�weather,location=us-midwest temperature=87,humidity=78 1465839830200400200�weather,location=us-west temperature=72,humidity=56 1465839830200400200�weather,location=us-east temperature=84,humidity=67 1465839830300400200�weather,location=us-midwest temperature=90,humidity=82 1465839830400400200�weather,location=us-west temperature=71,humidity=57 1465839830400400200

24

25 of 55

In memory format for fast vectorized processing

https://arrow.apache.org/

25

26 of 55

Arrow: Benefits for InfluxDB

⇒ Best practice for storing columnar data in memory

Type System: Full type support (integers, strings, timestamps, etc)

Null Support: Standard null bitmask representation + semantics

Efficient Encodings: Dictionary encoding for Strings

Natural Integration: DataFusion, parquet libraries, Arrow Flight

Optimized Compute Kernels: Fast vectorized kernels are well understood, but time consuming to implement, test and maintain

Note: Arrow has (many) more features than this

26

27 of 55

Arrow Array: Int64Array

Pretty much what you will find in every vectorized column store engine

15
1743
NULL
432
..
9
322
8

Logically 1024

8 byte integers

Buffer (aligned)

8192 bytes

15

0

Byte Offset

1743

??

432

9

322

8

…

8

16

24

8168

8176

8184

…

Arrow Array

1

0

1

Validity (bitmask)

128 bytes

Bit Offset

…

0

1

2

3

127

126

125

27

28 of 55

Compute Kernels

let output = gt(

&left,

&right

);

+

10

20

17

5

23

5

9

12

4

5

76

2

3

5

2

33

2

1

6

7

8

2

7

2

5

6

7

8

left

right

output

1

0

1

0

1

0

1

0

>

The gt (greater than) kernel computes an output BooleanArray where each element is left > right

Kernels handle nulls (validity masks), optimizations for different data sizes, etc.

~50 different kernels, full list: docs.rs page

28

29 of 55

How does InfluxDB 3.0 use Arrow: Ingester Buffers

Indirectly: via DataFusion and Flight.

Directly: Ingest path, which parses input line protocol → Arrow Arrays

Ingester

4. If queried prior to writing parquet buffer is snapshotted, turned into RecordBatch sent to querier via ArrowFlight

Querier

2. LP is parsed and appended to an in memory buffer (mutable batch)

open buffer

3. Periodically writes sorted data on object store using datafusion plan

Snapshots

(Record Batches)

1. Incoming Line Protocol

weather,location=us-east temperature=82,humidity=67 1465839830100400200�weather,location=us-midwest temperature=82,humidity=65 1465839830100400200�weather,location=us-west temperature=70,humidity=54 1465839830100400200�weather,location=us-east temperature=83,humidity=69 1465839830200400200�weather,location=us-midwest temperature=87,humidity=78 1465839830200400200�weather,location=us-west temperature=72,humidity=56 1465839830200400200�weather,location=us-east temperature=84,humidity=67 1465839830300400200�weather,location=us-midwest temperature=90,humidity=82 1465839830400400200�weather,location=us-west temperature=71,humidity=57 1465839830400400200

Object Store

29

30 of 55

https://github.com/apache/arrow-datafusion

Highly customizable, fast query engine that uses Arrow

30

31 of 55

Arrow DataFusion: Benefits for InfluxDB

⇒ Fast full featured, extensible query engine

All the OLAP buzzwords: vectorized, columnar, multi-core, streaming, out of core, …

“Full” SQL: JOINs, date/time/timestamp functions, structured data, …

Customizable: Easily extend time series specific functions (e.g. date_bin gapfill, InfluxQL, …)

31

32 of 55

DataFusion: Input / Output

Data Batches

SQL Query

SELECT status, COUNT(1)

FROM http_api_requests_total

WHERE path = '/api/v2/write'

GROUP BY status;

Data Batches

DataFrame

ctx.read_table("http")?

.filter(...)?

.aggregate(..)?;

Catalog information:

tables, schemas, etc

32

33 of 55

DataFusion: Totally Customizable Architecture

SQL

Query FrontEnds

DataFrame

LogicalPlans

ExecutionPlan

Plan Representations

(DataFlow Graphs)

Expression Eval

Optimizations / Transformations

HashAggregate

Sort

…

Join

Parquet

CSV

DataFusion

Extend ✅

Optimized Execution Operators

(Arrow Based)

Catalog / Data Source

33

34 of 55

How InfluxDB 3.0 uses DataFusion

Optimization

(storage pruning, pushdown, etc)

Physical Planning

Execution

DataFusion

LogicalPlan

DataFusion

ExecutionPlan

DataFusion

Streams

gRPC output

Arrow Flight

Client Specific

Output formats

Arrow

Record Batches

ParquetWriter

SeriesFrame

...

FlightData

Write to Parquet files

Storage gRPC Frontend

SQL Frontend

(from DataFusion)

Client / Language Specific Frontends

read_group(..)

SELECT …

FROM …

Reorg Frontend

compact_plan(..)

InfluxQL Frontend

SELECT … FROM … GROUP BY *

All queries (3 different query languages)
Parquet creation + compaction

34

35 of 55

https://arrow.apache.org/docs/format/Flight.html

Flight

Efficiently send columnar data (as Arrow Arrays) over the network

35

36 of 55

Arrow Flight: Benefits for InfluxDB

⇒ Efficient network transfer; wide client support

Network Efficient: columnar format, high bandwidth transfer

CPU Efficient: “zero copy” Serialization / Deserialization

Pre Existing Clients: Minimal effort to make InfluxDB 3.0 clients (just use Arrow Flight clients)

36

37 of 55

How does InfluxDB 3.0 use Flight? Native Query Protocol

Client ←→ Querier

Querier

write path

read path

SELECT … FROM ...

1. User query via Flight

Ingester

2. Querier combines data from ingesters and parquet into response

3. Response via Flight

37

38 of 55

How does InfluxDB 3.0 use Flight?

Ingester ←→ Querier

Ingester and Querier’s internal protocol is built on Flight*

Ingester

Querier

SELECT … FROM ...

write path

read path

1. User query

2. The querier fetches unpersisted (not yet written to parquet) data returned from the ingester, as RecordBatches via Flight

* we are working on in an optimized flight-like protocol to reduce latency

38

39 of 55

https://arrow.apache.org/docs/format/FlightSql.html

FlightSQL

Execute SQL queries and return results with a standard (non custom) client

39

40 of 55

Arrow FlightSQL: Benefits for InfluxDB

⇒ Access to SQL ecosystem, without implementing our own drivers/connectors

Prepackaged Client Libraries: JDBC/ODBC/Ecosystem drivers

Pre-packaged Integrations: Many systems already read from FlightSQL

Community Leverage: E.g. FlightSQL Grafana plugin, others contribute

40

41 of 55

Apache Arrow FlightSQL

Send SQL queries, receive Responses as Arrow Arrays
Has clients in many languages / APIs (JDBC, python DB API, etc)

https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/

41

42 of 55

How InfluxDB 3.0 uses FlightSQL

✅ Access to the ecosystem without having to implement JDBC, ODBC, … 😅

client

arrow-flight JDBC

Driver

InfluxDB 3.0

Data is sent via Arrow FlightSQL

(columnar)

arrow-flight JDBC driver implements JDBC API

JDBC Interface

42

43 of 55

Before Flight SQL

Server

Database specific protocol (e.g. Postgres FEBE)

Caveat: Only show JDBC and ODBC drivers. Also common is a native implementation in each language ecosystem - e.g like the Python DB API 20 and psycopg2 etc. See Apache Arrow Database Connectivity (ADBC) for more details

client

JDBC

Driver

JDBC Interface

client

JDBC

Driver

JDBC Interface

client

ODBC

Driver

ODBC Interface

43

44 of 55

Alternate Strategy: Use Posgres FEBE

(Slow + painful 🤮)

client

Server

Client: Prepare query:"SELECT * FROM foo WHERE a = $1;" name:""

Client: Bind name:"" parameters:[{format:"text", value:"42"}]

… (steps elided) …

Server: RowDescription fields:[{name:"a", type:"int", format:"text"}, …]

Server: DataRow fields:[{data:"42"}, {data: "Hunter Valley"} …]

Server: DataRow fields:[{data:"12"}, {data: "Merrimack Valley"} …]

… (lots ros for each data)

Server: DataRow fields:[{data:"321"}, {data: "Charles River Watershed"} …]

… (steps elided) …

JDBC Interface

3. Data is sent row by row

Postgres

JDBC

Driver

DataRow {..}

…

1. Data is produced column by column in RecordBatches

PG febe adapter

2. Adapter converts to the postgres FEBE protocol

It also turns out clients using the postgres driver tend to try and query the postgres metadata tables 😱

44

45 of 55

Conclusion

45

46 of 55

Conclusion

Building Databases from scratch is hard (and expensive 💸)
You don’t have to anymore 🎉
We built InfluxDB 3.0 using Apache Flight, DataFusion, Arrow, and Parquet, and it was awesome
⭐⭐⭐⭐⭐: Highly recommended for your next projects

46

47 of 55

Related Work

The Composable Data Management System Manifesto (VLDB 2023)

Velox: Meta’s Unified Execution Engine

(VLDB 2022)

The Modern Data Architecture: The Deconstructed Database (USENIX ;login: Winter 2018)

A Deep Dive into Common Open Formats for Analytical DBMSs (VLDB 2023)

Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine

(To Appear SIGMOD 2024)

47

48 of 55

Try It Yourself

https://www.influxdata.com

https://github.com/InfluxCommunity

48

49 of 55

T H A N K Y O U

49

50 of 55

Backup Slides

50

51 of 55

Thank you!

Questions / Discussion

Find out more:

Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0
Apache Arrow: https://arrow.apache.org/
Apache DataFusion: https://arrow.apache.org/datafusion/
Apache Parquet: https://parquet.apache.org/

51

52 of 55

Defragmenting Data Access Across Systems

52

53 of 55

Integration: Arrow Language Implementations

53

Java

C++

JavaScript

Go

Rust

C

Ruby

Python

Julia

R

C#

Matlab

53

54 of 55

Why Arrow Internally (and not just at interface)?

Option 1: Use Arrow Internally (DataFusion, pola.rs, Acero)

Pros: Fast interchange, reuse Arrow libraries

Cons: Constrained(*) to Arrow

Option 2: Use specialized structures internally, convert to Arrow at edges (Velox, DuckDB)

Pros: Can use specialized structures

Cons: Maintain specialized code

Scan

Filter

Agg

Scan

Filter

?

Agg

?

Convert to arrow

54

55 of 55

Why Arrow Internally (and not just at interface)?

So far results are encouraging

Theory: Using Arrow is “good enough” compared to specialized structures

Pooled open source development → invest heavily in optimized Arrow kernels

https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter

https://github.com/tustvold/access-log-bench

Good: Sorting, Filtering, Projection, Parquet

Could Improve: Grouping, Joining

55