1 of 24

The Prometheus Time Series Database

Björn “Beorn” Rabenstein, Production Engineer, SoundCloud Ltd.

2 of 24

This is about sample storage…

…not indexing (as needed for PromQL)

Sample: 64bit timestamp + 64bit floating point value

3 of 24

The fundamental problem of TSDBs

Orthogonal write and read patterns.

Time (~weeks)

Time

Series

(~millions)

Writes

Reads

Let’s first have a look at time series databases in general.

Horizontal lines are time series, sequences of timestamped sample values. In Prometheus, you have typically millions of them, and the span a duration in the order of weeks.

Write (i.e. collecting sample data from monitored targets):

Adding a sample to many (possibly all) time series in a short amount of time (possibly simultaneously).

It’s usually append only, which comes in handy. At some point, however, you might want to truncate old data, or downsample. A problem to keep into mind.

Read (e.g. querying data for a graph or expression evaluation):

Reading samples from relatively few time series (most commonly one) over a relatively long time span (let’s say weeks).

That’s exactly orthogonal to the write pattern, which makes TSDBs so hard to get right.

There are exceptions for reads, but the bulk of expensive queries will be those along a time series.

This de-multiplexing is not a big deal if everything is in memory and you organize your data in a reasonably sane way.

There are in-memory TSDBs like Facebook’s Gorilla, which was developed independent from Prometheus, but features some striking similarities.

If you want to manage more data than what fits into memory, or if you want some durability guarantees other than memory checkpoints, you need some external storage device, may it be a local disk or some kind of distributed storage, and then things get really interesting.

4 of 24

External storage needed

Key-Value store (with BigTable semantics) seems suitable.

...

http_requests_total{status="200",method="GET"}@1434317560938 ⇒ 94355

http_requests_total{status="200",method="GET"}@1434317561287 ⇒ 94934

http_requests_total{status="200",method="GET"}@1434317562344 ⇒ 96483

http_requests_total{status="404",method="GET"}@1434317560938 ⇒ 38473

http_requests_total{status="404",method="GET"}@1434317561249 ⇒ 38544

http_requests_total{status="404",method="GET"}@1434317562588 ⇒ 38663

http_requests_total{status="200",method="POST"}@1434317560885 ⇒ 4748

http_requests_total{status="200",method="POST"}@1434317561483 ⇒ 4795

http_requests_total{status="200",method="POST"}@1434317562589 ⇒ 4833

http_requests_total{status="404",method="POST"}@1434317560939 ⇒ 122

...

Metric name

Dimensions aka Labels

Timestamp

Sample Value

VALUE

KEY

External as in “not RAM”, could be local disk or distributed storage.

BigTable semantics seems to be fine for the purpose. "Everything is a key-value store..."

Let's explain the base idea, just the first rough naive approach...

Explain the keys and values.

Mention labels shortly, indices needed for meaningful queries.

Explain finding a time series, iterating over it. Cf. read pattern as before.

Note how timestamps of the same color are approx. the same.

Depending on how you collect samples, they could also be exactly the same, but there is no need for that in this datamodel.

Neither for "constantly spaced" timestamps.

Not surprisingly, many distributed TSDBs are using something like BigTable as a backend, e.g.

OpenTSDB uses HBase.

KairosDB uses Cassandra.

The whole scheme looks horribly repetitive and like a huge waste of storage space.

However, BigTable-like databases have compressions built in, so much of the redundancy disappears before hitting the disk.

5 of 24

Google Cloud Bigtable Schema Design

https://cloud.google.com/bigtable/docs/schema-design-time-series

A more detailed description here.

For design goals to be discussed elsewhere, we wanted a local storage backend for Prometheus, so not BigTable, HBase, Cassandra...

There is a rich choice of on-disk key-value stores. Pick your poison...

Prometheus uses LevelDB, as you might recall, but only for indices.

In a prehistoric version, Prometheus used LevelDB for sample storage, too, but with many tweaks to the schema.

Despite the tweaks, we weren’t happy, and we couldn’t find a suitable other solution, so at some point we decided to build our own sample storage layer.

The main problem with the Bigtable semantics was that even with all the tweaks, the representation of samples in memory was different from the representation in the k/v store. While the conversion had an overhead, it also meant that any optimization in the storage layer would not help the in-memory representation.

6 of 24

Why is in-memory compression needed?

Gorilla vs. Prometheus

By Brocken Inaglory, CC BY-SA 3.0

In-memory only
1s resolution
Fixed-time blocks (2h)
Not concerned with decoding

Demultiplexing to local disk
1ms resolution
Fixed-size chunks (1kiB)
Random accessibility and decoding

Gorilla: A Fast, Scalable, In-Memory Time Series Database, T. Pelkonen et al.

Proceedings of the VLDB Endowment

Volume 8 Issue 12, August 2015

Pages 1816-1827

1.37 bytes/sample 3.3 bytes/sample

7 of 24

ΔΔ

v1

v2

-storage.local.chunk-encoding-version

8 of 24

Prometheus’s chunked storage

Series iterator

Chunk iterators

Chunks

9 of 24

Timestamp compression

“Pretty” regular sample intervals.

sample count

timestamp

0 → 1000s

1 → 1015s

2 → 1029s

3 → 1046s

4 → 1060s

0 → 1000s

1 → 1015s

2 → 1029s

3 → 1046s

4 → 1060s

15s

29s

46s

60s

15s

14s

17s

14s

-1s

1s

0s

-1s

3s

-3s

store with

variable

bit width

(1, 9, 12, 16, 36) as required per sample

store in fixed bit width (8, 16, 32) as required by sample set

v1

10 of 24

Prometheus v2 timestamp encoding

Almost like Gorilla, with different bit buckets…

If ΔΔt in [-32,31]: 10 + 6bit
If ΔΔt in [-65536,65535]: 110 + 17bit
If ΔΔt in [-4194304,4194303]: 111 + 23bit
If a chunk doesn’t get anything in 1h, we close it anyway.

BUT:

If ΔΔt = 0: 0 + 7bit counting repetitions (–1)

11 of 24

Value compression

Way more tricky...

64bit floating point numbers. Ugh...

12 of 24

13 of 24

Constant value time series

Prometheus v1/2

Store value once (64bit float).
Then store no values at all. The timestamp is enough.

→ 0bit/sample.

Gorilla

Store first value (64bit float).
Then store XOR between current and previous value (yields 0 for constant values).
Store a single 0 bit.

→ 1bit/sample.

14 of 24

The best case for a Prometheus v2 chunk

Constant metric value, perfectly regular scraping.

124,547 samples

(3w with 15s scrape interval)

0.066 bits/sample

15 of 24

16 of 24

Regularly increasing values

Prometheus v1

Apply same double-delta encoding as for timestamps.
Use integers (8, 16, 32 bit) internally if possible, otherwise float32. If 64bit are required, revert to storing values directly as float64.
For values increasing with precisely the same slope, 0bit needed.

Gorilla

As before: Store 1st value directly, then store XOR result of current value with previous value.
Now encode it in a clever way referring to previous XOR value (similar to double-delta, but the two steps are XOR and complicated).

v1

17 of 24

18 of 24

More or less random values

Prometheus

Double-delta encoding is tried, but fall-back to directly saving float64 values is likely.

Gorilla

Same encoding as before. Truly random data could result in an overhead (more than 64bit per sample).

v1

19 of 24

Prometheus v2 value encoding

Picks the first that works from the following list:

Zero encoding.
Integer double-delta encoding with 0/6/13/20/33 bit buckets
XOR float encoding (like Gorilla with minor tweaks)
Direct encoding (if XOR results in 64bit or more per value)

If you dare, check out storage/local/varbit.go.

1.28 bytes/sample (typical SoundCloud server)

20 of 24

Constant-size chunks.

1024 bytes.

chunk in memory

(complete and immutable)

head chunk

(incomplete)

Sample

Ingestion

memory

disk

evictable chunks (LRU)

chunk on disk

(complete and immutable)

PromQL

Query Engine

one file per time series

-storage.local.max-chunks-to-persist

-storage.local.memory-chunks

This is how the Prometheus sample storage works today, somewhat simplified.

Sample data is organized in constant-size chunks.

Originally, we kept them all in a free list or gigantic memory arena,

but interestingly, once we allocated and de-allocated with those constant sizes,

allocation churn ceased to be a big problem (avoiding fragmentation).

Things look like here, view of a single time series. Somewhat simplified...

<click>

Representation on disk is exactly the same as in memory, 1:1 correspondence.

Easy to seek (constant size).

<click>Queries are served entirely from memory.

Only conversions happen during input, results of parsing the network payload, and output, individual values needed for queries.

Each sample has only one non-ephemeral presence in memory.

The ‘one file per time series’ offloads a good part of the storage logic to the file system.

Although file systems are not specifically optimized for this use-case, they do a quite good job.

We probably would need a lot of work and expertise to do something better with a ‘one big file (or a raw block device) with our own data management within’.

You would essentially implement your own special-purpose file system (like certain enterprise DBs do - and like InfluxDB).

21 of 24

Series maintenance.

memory

disk

older than retention time

(and larger than 10% of file)

-storage.local.retention

-storage.local.series-file-shrink-ratio

-storage.local.series-sync-strategy

22 of 24

Chunk preloading.

memory

disk

PromQL

Query Engine

23 of 24

Checkpointing.

On shutdown and regularly to limit data loss in case of a crash.

memory

disk

checkpoint file

-storage.local.checkpoint-interval

-storage.local.checkpoint-dirty-series-limit

1 of 24

2 of 24

3 of 24

4 of 24

5 of 24

6 of 24

7 of 24

8 of 24

9 of 24

10 of 24

11 of 24

12 of 24

13 of 24

14 of 24

15 of 24

16 of 24

17 of 24

18 of 24

19 of 24

20 of 24

21 of 24

22 of 24

23 of 24

24 of 24