1 of 40

Sharing is Caring

Leveraging Open Source to�Improve Cortex & Thanos

Bartłomiej Płotka

@bwplotka

Marco Pracucci

@pracucci

@cortexmetrics

@ThanosMetrics

@bwplotka

@pracucci

2 of 40

A personal story…

@bwplotka

@pracucci

3 of 40

A personal story…

Open source is

not about technology

@bwplotka

@pracucci

4 of 40

A personal story…

Open source is

about people

@bwplotka

@pracucci

5 of 40

A personal story…

Open source is

collaboration

and giving back

@bwplotka

@pracucci

6 of 40

Hello!

Marco Pracucci

Software Engineer @ Grafana Labs

Cortex and Thanos maintainer

@bwplotka

@pracucci

7 of 40

Hello!

Bartłomiej Płotka

Principal Software Engineer @ Red Hat

Prometheus maintainer

Co-founder of Thanos

CNCF SIG Observability Tech Lead

@bwplotka

@pracucci

8 of 40

Two households, both alike in dignity

@bwplotka

@pracucci

9 of 40

Two households, both alike in dignity

In the beginning...

Push

vs

Pull

NoSQL + Object Storage

vs

Object Storage

Custom Indexing

vs

TSDB

@bwplotka

@pracucci

10 of 40

Cortex & Thanos: Similarities

  • Goals!
  • Started by Prometheus Maintainers
  • Use gRPC for communication
  • Written in Go
  • Same external APIs (Rules, Alerting, Querying, Meta)
  • Reused Prometheus code, and patterns
  • Horizontally Scalable
  • Joined CNCF

@bwplotka

@pracucci

11 of 40

But, competition kicks in...

@bwplotka

@pracucci

12 of 40

But, competition kicks in...

The soft competition with Thanos has been�a trigger to improve Cortex UX as well

  • 6-weeks release process
  • User friendly documentation
  • Reduced dependencies (ie. Gossip hashring)
  • Single binary mode
  • Can we remove the NoSQL index store at all?

@bwplotka

@pracucci

13 of 40

The rise of the Cortex blocks storage

@bwplotka

@pracucci

14 of 40

The rise of the Cortex blocks storage

Collaboration

over

Competition

@bwplotka

@pracucci

15 of 40

Meanwhile on Thanos side...

Object Storage

Blocks

Querier

Kubernetes US1

Compactor

upload

Gateway

+deduplication

@bwplotka

@pracucci

16 of 40

Thanos Receiver: A little bit of Push

Object Storage

Blocks

Querier

Kubernetes US1

Compactor

streaming

+deduplication

Gateway

Receiver

@bwplotka

@pracucci

17 of 40

Query Performance

@bwplotka

@pracucci

18 of 40

Query Performance

@bwplotka

@pracucci

19 of 40

Thanos Query Performance

Collaboration

over

Competition

@bwplotka

@pracucci

20 of 40

Collaboration time!

Collaboration is a

two-way street

@bwplotka

@pracucci

21 of 40

Collaboration time!

Collaboration

is tough

sometimes

@bwplotka

@pracucci

22 of 40

Collaboration time!

It’s very tempting to pick

the shortest path

@bwplotka

@pracucci

23 of 40

Collaboration time!

SUCCESSES

FAILURES

PRINCIPLES

PRACTICES

TECHNIQUES

Sharing is learning

@bwplotka

@pracucci

24 of 40

Collaboration time!

Collaboration

pays off

@bwplotka

@pracucci

25 of 40

Thanos bucket store and compactor

Cortex blocks storage is built on top of Thanos shipper, bucket store and compactor

querier-1

querier-2

querier-n

store-gateway-1

store-gateway-2

store-gateway-n

ingester-1

ingester-2

ingester-n

distributor-1

distributor-2

distributor-n

query-frontend-1

query-frontend-2

compactor-1

compactor-n

Using Thanos Compactor

Using Thanos Shipper

Using Thanos Bucket Store

@bwplotka

@pracucci

26 of 40

Cortex end-to-end testing framework

Easily write isolated end-to-end tests running in Docker

@bwplotka

@pracucci

27 of 40

Cortex query-frontend

Query parallelization and results caching support to any Prometheus API compatible backend

@bwplotka

@pracucci

28 of 40

Cortex caching strategies

Learning from Cortex caching strategies, we added memcached support to Thanos

TSDB Index�Cache

TSDB Chunks�Cache

Bucket Metadata�Cache

Thanos Store

@bwplotka

@pracucci

29 of 40

Thanos DNS-based service discovery

Easy to use DNS-based service discovery

  • dns+service.local:80�A/AAAA lookup�
  • dnssrv+service.local�SRV lookup with A/AAAA resolution�
  • dnssrvnoa+service.local�SRV lookup only

@bwplotka

@pracucci

30 of 40

Cross project optimizations

Share optimization techniques learnings

Thanos:

@bwplotka

@pracucci

31 of 40

Cross project optimizations

Share optimization techniques learnings

Prometheus:

@bwplotka

@pracucci

32 of 40

More stress testing

More reuse = More ways to exercise the code!

@bwplotka

@pracucci

33 of 40

More brain power

It’s not just about features!

More developers =

  • More ideas
  • More often revisited code
  • More bugs fixed

@bwplotka

@pracucci

34 of 40

Summary

  • Open Source is about people, collaboration and pushing boundaries together!

@bwplotka

@pracucci

35 of 40

Summary

  • Open Source is about people, collaboration and pushing boundaries together!
  • Reuse is not easy, but at the end: invaluable!

@bwplotka

@pracucci

36 of 40

Future

“One does not change�a winning team”

@bwplotka

@pracucci

37 of 40

Thank You!

Any Questions?

  • We are on CNCF Slack: #cortex and #thanos channels 🤗
  • Marco: https://pracucci.com @pracucci
  • Bartek: https://bwplotka.dev @bwplotka

@cortexmetrics

@ThanosMetrics

@bwplotka

@pracucci

38 of 40

Agenda: Story of two projects

  • Started with opposite designs, solving similar goals (how it started)
  • Learning on the way from each other with initial competition
  • Competition kicks in: slowly reimplementing similar features, starting discussions about reuse instead doing all from scratch (block storage, query frontend, simpler config and usage)
    • How decision was made to reuse? (time to market, we know Thanos ppl)
  • Collaboration time! How to not kill each other and do it efficiently?
    • Collaboration on both sides!
      • should not bring a negative downside on Thanos
      • Thanos community has to accept changes that are not strictly beneficial to Thanos.
    • Technically:
  • Was it worth it? (real examples of collab?)
    • Shared features
    • Maintenance - more devs, revisiting code, fixing bugs
    • Experimenting is still ok, as long as back ported
  • Future

@bwplotka

@pracucci

39 of 40

In the beginning: Cortex

Object Storage

(Chunks)

Query Frontend

Kubernetes US1

Distributor

streaming

Ingester

NoSQL

(Index)

Querier

@bwplotka

@pracucci

40 of 40

In the beginning: Thanos

Object Storage

Blocks

Querier

Kubernetes US1

Compactor

upload

Gateway

+deduplication

@bwplotka

@pracucci