1 of 40

1

Lessons learned

from operating

Pierre Zemb

Technical Leader OVHcloud

2 of 40

$ whoami

2

  • Pierre Zemb
  • Technical Leader
  • Working around distributed systems
    • Apache {HBase, Flink, Kafka, Pulsar}
    • FoundationDB, ETCD
  • Involved into local communities

3 of 40

$ whoami

3

  • Pierre Zemb
  • Senior Software Engineer
  • Working around distributed systems
    • Apache {HBase, Flink, Kafka, Pulsar}
    • FoundationDB, ETCD
  • Involved into local communities

4 of 40

Schedule

  • Introduction to ETCD
  • OVHcloud's managed Kubernetes and ETCD
  • Lessons learned from production
    • Observability
    • Configuration
    • Tips and tricks

4

5 of 40

OVHcloud: a global leader

5

Web Cloud & Telcom

Private Cloud

Public Cloud

Bare Metal Cloud

Network & Security

31 Data Centers �in 12 locations

48 Points of Presence�on a 20 TBPS Bandwidth Network

2450+ Employees�worldwide

117K Private Cloud�VMS running

395K Physical Servers�running in our data centers

1 Million+ Servers

produced since 1999

340K Public Cloud�instances running

1.6 Million Customers�across 132 countries

1.5 Billion Euros Invested�since 2016

20+ Years in Business�Disrupting since 1999

P.U.E. 1.14�Energy efficiency indicator

6 Million Websites�hosting

6 of 40

OVHcloud: a global leader

6

7 of 40

What is ETCD?

  • strongly consistent, distributed key-value store
  • Stands for /etc in Linux but in distributed fashion
  • CNCF graduated project

7

8 of 40

ETCD in Kubernetes

8

9 of 40

ETCD in Kubernetes

9

10 of 40

ETCD in Kubernetes

  • ETCD holds the entire state of your Kubernetes clusters
  • ETCD has reactive capabilities thanks to watches
  • only the apiserver can talk to ETCD

10

11 of 40

Kubinception

11

12 of 40

Kubinception

12

Customer 1

control-plane

Customer 2

control-plane

13 of 40

Kubinception

13

14 of 40

Kubeinception implication for ETCD

That kind of workload is putting a lot of 😓 to ETCD:

  • Dozens of ETCD clusters supporting each up to:
    • ~2k ranges/s
    • ~800 txn/s (1 txn every 1.103 ms)
    • ~1.6k msg/s sent through Watch

14

15 of 40

15

Lessons learned

from operating

Pierre Zemb

Technical Leader OVHcloud

16 of 40

16

Observe

Pierre Zemb

Technical Leader OVHcloud

17 of 40

Enable observability

17

--metrics extensive \

--logger 'zap'

18 of 40

What to observe on ETCD?

🔍🤔

18

19 of 40

4 layers to watch

  • gRPC for communication

19

gRPC

20 of 40

4 layers to watch

  • gRPC for communication
  • Raft for consensus

20

gRPC

Raft engine

21 of 40

4 layers to watch

  • gRPC for communication
  • Raft for consensus
  • a WAL for recovery and ordering ops
  • a bbolt to store key and values

21

gRPC

Raft engine

WAL

bbolt

22 of 40

Observe ETCD: gRPC

22

Metrics name

Helper

grpc_server_handled_total

Total number of RPCs per gRPC method

etcd_network_client_grpc_sent_bytes_total

The total number of bytes sent

etcd_network_client_grpc_received_bytes_total

The total number of bytes received

etcd_network_peer_round_trip_time_seconds

Round-Trip-Time histogram between peers

gRPC

Raft engine

WAL

bbolt

23 of 40

Observe Raft: Leader and follower

  • Raft implies there is a Leader and Followers

23

gRPC

WAL

bbolt

Metrics name

Helper

etcd_server_has_leader

Whether or not a leader exists.

etcd_server_leader_changes_seen_total

The number of leader changes seen.

Raft engine

24 of 40

Observe Raft: Leader and follower

  • Raft implies there is a Leader and a Follower
  • You can easily visualize who's leader or not with etcdctl

24

gRPC

Raft engine

$ etcdctl endpoint status -w table

+-------------------------+------------------+---------+---------+-----------+-----------+-------------+

| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |

+-------------------------+------------------+---------+---------+-----------+-----------+-------------+

| https://10.0.0.101:2379 | e1819b08938c139e | 3.4.9 | 5.7 GB | false | 4095 | 26636440814 |

| https://10.0.0.102:2379 | 5ca1e82d1803044f | 3.4.9 | 5.7 GB | false | 4095 | 26636440817 |

| https://10.0.0.103:2379 | 2180ef520bcb43f1 | 3.4.9 | 5.7 GB | true | 4095 | 26636440819 |

+-------------------------+------------------+---------+---------+-----------+-----------+-------------+

25 of 40

Observe Raft: Leader and follower

  • Leader is sending heartbeats to followers
  • Sometimes followers are taking too much time to respond
    • Generally because of slow disks/network

25

gRPC

WAL

bbolt

Metrics name

Helper

etcd_server_heartbeat_send_failures

The total number of leader heartbeat send failures

Raft engine

26 of 40

Observe Raft: Proposals

  • Mutating the data is called a proposal

26

gRPC

WAL

bbolt

Metrics name

Helper

etcd_server_proposals_pending

how many proposals are queued to commit.

etcd_server_proposals_committed_total

total number of consensus proposals committed.

etcd_server_proposals_applied_total

total number of consensus proposals applied

etcd_server_proposals_failed_total

The total number of failed proposals seen.

Raft engine

27 of 40

Observe Raft: slow calls

  • ETCD has some tracing for >100ms calls

27

gRPC

WAL

bbolt

Raft engine

28 of 40

Observe Raft: slow calls

  • ETCD has some tracing for >100ms calls
  • and of course metrics 😁

28

gRPC

WAL

bbolt

Metrics name

Helper

etcd_server_slow_apply_total

Total number of slowed applies

etcd_server_slow_read_indexes_total

Total number of slow read executed on a Follower

etcd.debugging.mvcc.slow_watcher.total

Total number of unsynced slow watchers.

Raft engine

29 of 40

Observe ETCD: WALs

  • WALs stands for write-ahead log
  • Any mutation will be written before processing
    • 🚀 SSD at the very least, NVMe is important here
    • 🚫 no network volume device

29

Raft engine

gRPC

bbolt

WAL

Metrics name

Helper

etcd_disk_wal_fsync_duration_seconds

The latency distributions of fsync called by wal

30 of 40

Observe ETCD: WALs

  • WALs stands for write-ahead log
  • Any mutation will be written before processing
    • 🚀 SSD at the very least, NVMe is important here
    • 🚫 no network volume device

30

Raft engine

gRPC

bbolt

WAL

31 of 40

Operating ETCD: storage size

31

gRPC

Raft engine

WAL

bbolt

Metrics name

Helper

etcd_server_quota_backend_bytes

Current backend storage quota size in bytes.

etcd_mvcc_db_total_size_in_bytes

Current backend storage size in bytes.

32 of 40

32

Tips and tricks on

Pierre Zemb

Technical Leader OVHcloud

33 of 40

Operating ETCD: compaction

  • Since etcd keeps an exact history of its keyspace, this history should be periodically compacted
  • Can be automated or controlled by the client

33

gRPC

WAL

bbolt

Raft engine

# Number of revisions or time in hour before compacting

--auto-compaction-retention 1 \

# Available values: periodic or revision

--auto-compaction-mode periodic

34 of 40

Operating ETCD: storage size

  • bbolt uses a memory-mapped file so the underlying operating system handles the caching of the data.

⚠️ Do not cap ETCD’s memory below max data limit ⚠️

34

gRPC

Raft engine

WAL

bbolt

35 of 40

Operating ETCD: defrag

  • After compacting the keyspace, bbolt may exhibit internal fragmentation.

35

gRPC

Raft engine

WAL

bbolt

# run locally

etcdctl defrag

36 of 40

Operating ETCD: backups

  • The embedded tool is good, you can use it 😎
    • based on the Snapshot feature in Raft
    • Snapshots are local
      • it is best to trigger it on a follower

36

etcdctl snapshot save

37 of 40

Operating ETCD: cluster operation

  • In order to replace a failed machine, you need to:
    • remove the faulty member with etcdctl
    • add a new member with etcdctl

👉 Automate this process 👈

37

38 of 40

Using ETCD: client configuration

Enable DialKeepAliveTime and DialKeepAliveTimeout on ETCD's clients

38

39 of 40

TL;DR

  • Observe everything, from gRPC to I/O latencies
  • Triple check disks and network latencies
  • Raise storage size limit to 8GB
  • If you need more than 8GB, --etcd-servers-overrides=/events is your friend 😁
  • Dedicated machines with SSD at the very least, or NVMe
  • Don't be stingy on RAM limits (at least twice the storage size limit)
  • Compact and defrag members
  • Run snapshots on Followers
  • Automate membership alterations to ease on-call duty
  • Enable DialKeepAliveTime and DialKeepAliveTimeout on ETCD's clients

39

40 of 40

40

Slides

Twitter

Github

https://pierrezemb.fr

PierreZ

PierreZ

Thanks!

Do you have some questions?

Visit OVHcloud's

virtual booth!