InfluxDB�on Apache Flink
Open Source Data Processing – Data Engineering Systems Group
Ramin Gharib, Felix Seidel, and Leon Papke
Winter Semester 2020/2021
03.03.2021
Agenda
2
Use Case
Connector Design
Benchmarks
OSDP
Use Case
3
Flight Tracking
4
OSDP
Data Point (Example)
5
// Real World Example
adsb_icao,flight=DLH9MU,ground=false tas=442,alt_baro=39000,lat=52.331,lon=12.456 1608125488700000047
OSDP
Data Point (Schema)
6
// Example
adsb,ground=false,scientist=mullen seen_pos=0,alt_geom=6625 1608123758799999952
---- ----------------------------- ------------------------ -------------------
| | | |
Measurement Tag set Field set Timestamp
(Required) (Optional) (Required) (Optional)
Data Point
OSDP
Flight Tracking
7
OSDP
Flight Tracking
8
OSDP
Flight Tracking
9
OSDP
Flight Tracking
10
OSDP
Flight Tracking
11
OSDP
Flight Tracking
12
OSDP
Connector�Design
13
High Level Architecture
14
POST /api/v2/write
Content-Type: text/html
Body:
test longValue=1i 1�test longValue=2i 2
test longValue=3i 3
...
Split Reader 0
Source Reader
Task Manager
...
Data Source
OSDP
A Push-Based Design
15
OSDP
User-Defined Deserializer
16
Data point parser
test longValue=1i 1�test longValue=2i 2
test longValue=3i 3
...
Request Body (String)
Deserialize
test longValue=1i 1�test longValue=2i 2
test longValue=3i 3
...
Data Point (Object)
1L, 2L, ...
...
Data Source
OSDP
High Level Architecture
17
Sink Writer
1L, 2L, ...
test longValue=1i
test longValue=2i
...
Sink Committer
Last Timestamp
checkpoint checkpoint=flink <timestamp>
OSDP
Sink Writer with User-Defined Serializer
18
Serialize incoming data
1L, 2L, ...
Add data point to buffer
test,longValue=1 fieldKey=sink
test,longValue=2 fieldKey=sink
...
Data Point (Object)
Is buffer full?
Yes
Wait for other input
No
OSDP
Exactly-once Delivery Guarantee
19
OSDP
Benchmarks
20
Experiment Design
21
OSDP
Source Benchmarks
Throughput�+ Latency:
Query
Query
Data Generator
Data Generator
HTTP Post
HTTP Post
HTTP Post
Latency:
22
testGenerator, simpleTag=testTag fieldCount=i++ eventTime
...
OSDP
Source Queries
23
OSDP
Event Throughput
24
OSDP
Request Throughput
25
OSDP
Event Latency
26
OSDP
Sink Queries
27
OSDP
Event Throughput
28
OSDP
Event Latency
29
OSDP
Future Work
30
Possible Extensions
31
OSDP
Q&A
InfluxDB on Apache Flink
Ramin Gharib, Felix Seidel, and Leon Papke
32
Appendix
33
OSDP
InfluxDB Editions
“InfluxDB OSS does not support clustering. For high availability or horizontal scaling of InfluxDB, consider the InfluxData commercial clustered offering, InfluxDB Enterprise/Cloud.”
34
OSDP
Storage Engine
35
OSDP
InfluxDB OSS Architecture (v2)
36
OSDP
Updates in InfluxDB
“Time series data is predominantly new data that is never updated. Deletes generally only affect data that isn’t being written to, and contentious updates never occur.”
InfluxDB Design Principles
37
OSDP
Duplicate Data in InfluxDB
“To simplify conflict resolution and increase write performance, InfluxDB assumes data sent multiple times is duplicate data. Identical points aren’t stored twice. If a new field value is submitted for a point, InfluxDB updates the point with the most recent field value. In rare circumstances, data may be overwritten.”
InfluxDB, Design Principles.
38
OSDP
Initial Source Design Ideas
39
| Write API (Telegraf) | Pull (Flux queries) | Push (Tasks) | |
Idea | Flink implements InfluxDB 2.0 write API endpoint | Flink implements custom HTTP endpoint | Flink implements poll-based queries from InfluxDB | Flink implements endpoint to which InfluxDB tasks push data |
Pro | Flink compatible with all InfluxDB integrations | No dependency on write API endpoint interface | Replayable Source Easy to use (write Flux queries as usual) | Lower Latency |
Con | Non-replayable API format dependency | Non-replayable custom integrations required for upstream clients | Latency Achieving exactly-once may be hard | Non-Replayable Achieving exactly-once may be hard |
| Write API (Telegraf) | Pull (Flux queries) | Push (Tasks) | |
Idea | Flink implements InfluxDB 2.0 write API endpoint | Flink implements custom HTTP endpoint | Flink implements poll-based queries from InfluxDB | Flink implements endpoint to which InfluxDB tasks push data |
Pro | Flink compatible with all InfluxDB integrations | No dependency on write API endpoint interface | Replayable Source Easy to use (write Flux queries as usual) | Lower Latency |
Con | Non-replayable API format dependency | Non-replayable custom integrations required for upstream clients | Latency Achieving exactly-once may be hard | Non-Replayable Achieving exactly-once may be hard |
OSDP
Initial Sink Design Ideas
40
| Checkpoint Series | “Uncommitted” Bucket |
Idea | Insert an additional measurement whenever Flink creates a checkpoint. | Write new data into an „uncommitted table“ and copy the table contents into the „committed table“ when creating a checkpoint. |
Pro | The user can query the data up to the most recent checkpoint. | No complex query logic to differentiate between checkpointed data and not checkpointed data. |
Con | “Max & Filter Query” required to only get committed data | A lot of data copying. |
| Checkpoint Series | “Uncommitted” Bucket |
Idea | Insert an additional measurement whenever Flink creates a checkpoint. | Write new data into an „uncommitted table“ and copy the table contents into the „committed table“ when creating a checkpoint. |
Pro | The user can query the data up to the most recent checkpoint. | No complex query logic to differentiate between checkpointed data and not checkpointed data. |
Con | “Max & Filter Query” required to only get committed data | A lot of data copying. |
OSDP
User-Defined Deserializer
41
OSDP
User-Defined Serializer
42
OSDP