1 of 24

The Prometheus Time Series Database

Björn “Beorn” Rabenstein, Production Engineer, SoundCloud Ltd.

2 of 24

This is about sample storage…

…not indexing (as needed for PromQL)

Sample: 64bit timestamp + 64bit floating point value

3 of 24

The fundamental problem of TSDBs

Orthogonal write and read patterns.

Time (~weeks)

Time

Series

(~millions)

Writes

Reads

4 of 24

External storage needed

Key-Value store (with BigTable semantics) seems suitable.

...

http_requests_total{status="200",method="GET"}@1434317560938 ⇒ 94355

http_requests_total{status="200",method="GET"}@1434317561287 ⇒ 94934

http_requests_total{status="200",method="GET"}@1434317562344 ⇒ 96483

http_requests_total{status="404",method="GET"}@1434317560938 ⇒ 38473

http_requests_total{status="404",method="GET"}@1434317561249 ⇒ 38544

http_requests_total{status="404",method="GET"}@1434317562588 ⇒ 38663

http_requests_total{status="200",method="POST"}@1434317560885 ⇒ 4748

http_requests_total{status="200",method="POST"}@1434317561483 ⇒ 4795

http_requests_total{status="200",method="POST"}@1434317562589 ⇒ 4833

http_requests_total{status="404",method="POST"}@1434317560939 ⇒ 122

...

Metric name

Dimensions aka Labels

Timestamp

Sample Value

VALUE

KEY

5 of 24

Google Cloud Bigtable Schema Design

https://cloud.google.com/bigtable/docs/schema-design-time-series

6 of 24

Why is in-memory compression needed?

Gorilla vs. Prometheus

  • In-memory only
  • 1s resolution
  • Fixed-time blocks (2h)
  • Not concerned with decoding
  • Demultiplexing to local disk
  • 1ms resolution
  • Fixed-size chunks (1kiB)
  • Random accessibility and decoding

Gorilla: A Fast, Scalable, In-Memory Time Series Database, T. Pelkonen et al.

Proceedings of the VLDB Endowment

Volume 8 Issue 12, August 2015

Pages 1816-1827

1.37 bytes/sample 3.3 bytes/sample

7 of 24

ΔΔ

v1

v2

-storage.local.chunk-encoding-version

8 of 24

Prometheus’s chunked storage

Series iterator

Chunk iterators

Chunks

9 of 24

Timestamp compression

“Pretty” regular sample intervals.

sample count

timestamp

0 → 1000s

1 → 1015s

2 → 1029s

3 → 1046s

4 → 1060s

0 → 1000s

1 → 1015s

2 → 1029s

3 → 1046s

4 → 1060s

15s

29s

46s

60s

15s

14s

17s

14s

-1s

1s

0s

-1s

3s

-3s

store with

variable

bit width

(1, 9, 12, 16, 36) as required per sample

store in fixed bit width (8, 16, 32) as required by sample set

v1

10 of 24

Prometheus v2 timestamp encoding

Almost like Gorilla, with different bit buckets…

  • If ΔΔt in [-32,31]: 10 + 6bit
  • If ΔΔt in [-65536,65535]: 110 + 17bit
  • If ΔΔt in [-4194304,4194303]: 111 + 23bit
  • If a chunk doesn’t get anything in 1h, we close it anyway.

BUT:

  • If ΔΔt = 0: 0 + 7bit counting repetitions (–1)

11 of 24

Value compression

Way more tricky...

64bit floating point numbers. Ugh...

12 of 24

13 of 24

Constant value time series

Prometheus v1/2

  • Store value once (64bit float).
  • Then store no values at all. The timestamp is enough.

→ 0bit/sample.

Gorilla

  • Store first value (64bit float).
  • Then store XOR between current and previous value (yields 0 for constant values).
  • Store a single 0 bit.

→ 1bit/sample.

14 of 24

The best case for a Prometheus v2 chunk

Constant metric value, perfectly regular scraping.

124,547 samples

(3w with 15s scrape interval)

0.066 bits/sample

15 of 24

16 of 24

Regularly increasing values

Prometheus v1

  • Apply same double-delta encoding as for timestamps.
  • Use integers (8, 16, 32 bit) internally if possible, otherwise float32. If 64bit are required, revert to storing values directly as float64.
  • For values increasing with precisely the same slope, 0bit needed.

Gorilla

  • As before: Store 1st value directly, then store XOR result of current value with previous value.
  • Now encode it in a clever way referring to previous XOR value (similar to double-delta, but the two steps are XOR and complicated).

v1

17 of 24

18 of 24

More or less random values

Prometheus

  • Double-delta encoding is tried, but fall-back to directly saving float64 values is likely.

Gorilla

  • Same encoding as before. Truly random data could result in an overhead (more than 64bit per sample).

v1

19 of 24

Prometheus v2 value encoding

Picks the first that works from the following list:

  1. Zero encoding.
  2. Integer double-delta encoding with 0/6/13/20/33 bit buckets
  3. XOR float encoding (like Gorilla with minor tweaks)
  4. Direct encoding (if XOR results in 64bit or more per value)

If you dare, check out storage/local/varbit.go.

1.28 bytes/sample (typical SoundCloud server)

20 of 24

Constant-size chunks.

1024 bytes.

chunk in memory

(complete and immutable)

head chunk

(incomplete)

Sample

Ingestion

memory

disk

evictable chunks (LRU)

chunk on disk

(complete and immutable)

PromQL

Query Engine

one file per time series

-storage.local.max-chunks-to-persist

-storage.local.memory-chunks

21 of 24

Series maintenance.

memory

disk

older than retention time

(and larger than 10% of file)

-storage.local.retention

-storage.local.series-file-shrink-ratio

-storage.local.series-sync-strategy

22 of 24

Chunk preloading.

memory

disk

PromQL

Query Engine

23 of 24

Checkpointing.

On shutdown and regularly to limit data loss in case of a crash.

memory

disk

checkpoint file

-storage.local.checkpoint-interval

-storage.local.checkpoint-dirty-series-limit

24 of 24