1 of 40

Lessons learned

from operating

Pierre Zemb

Technical Leader OVHcloud

2 of 40

$ whoami

Pierre Zemb
Technical Leader
Working around distributed systems

Apache {HBase, Flink, Kafka, Pulsar}
FoundationDB, ETCD

Involved into local communities

3 of 40

$ whoami

Pierre Zemb
Senior Software Engineer
Working around distributed systems

Apache {HBase, Flink, Kafka, Pulsar}
FoundationDB, ETCD

Involved into local communities

4 of 40

Schedule

Introduction to ETCD
OVHcloud's managed Kubernetes and ETCD
Lessons learned from production

Observability
Configuration
Tips and tricks

5 of 40

OVHcloud: a global leader

Web Cloud & Telcom

Private Cloud

Public Cloud

Bare Metal Cloud

Network & Security

31 Data Centers �in 12 locations

48 Points of Presence�on a 20 TBPS Bandwidth Network

2450+ Employees�worldwide

117K Private Cloud�VMS running

395K Physical Servers�running in our data centers

1 Million+ Servers

produced since 1999

340K Public Cloud�instances running

1.6 Million Customers�across 132 countries

1.5 Billion Euros Invested�since 2016

20+ Years in Business�Disrupting since 1999

P.U.E. 1.14�Energy efficiency indicator

6 Million Websites�hosting

6 of 40

OVHcloud: a global leader

7 of 40

What is ETCD?

strongly consistent, distributed key-value store
Stands for /etc in Linux but in distributed fashion
CNCF graduated project

8 of 40

ETCD in Kubernetes

9 of 40

ETCD in Kubernetes

10 of 40

ETCD in Kubernetes

ETCD holds the entire state of your Kubernetes clusters
ETCD has reactive capabilities thanks to watches
only the apiserver can talk to ETCD

11 of 40

Kubinception

12 of 40

Kubinception

Customer 1

control-plane

Customer 2

control-plane

13 of 40

Kubinception

14 of 40

Kubeinception implication for ETCD

That kind of workload is putting a lot of 😓 to ETCD:

Dozens of ETCD clusters supporting each up to:

~2k ranges/s
~800 txn/s (1 txn every 1.103 ms)
~1.6k msg/s sent through Watch

15 of 40

Lessons learned

from operating

Pierre Zemb

Technical Leader OVHcloud

16 of 40

Observe

Pierre Zemb

Technical Leader OVHcloud

17 of 40

Enable observability

--metrics extensive \

--logger 'zap'

18 of 40

What to observe on ETCD?

🔍🤔

19 of 40

4 layers to watch

gRPC for communication

gRPC

20 of 40

4 layers to watch

gRPC for communication
Raft for consensus

gRPC

Raft engine

21 of 40

4 layers to watch

gRPC for communication
Raft for consensus
a WAL for recovery and ordering ops
a bbolt to store key and values

gRPC

Raft engine

WAL

bbolt

22 of 40

Observe ETCD: gRPC

Metrics name	Helper
grpc_server_handled_total	Total number of RPCs per gRPC method
etcd_network_client_grpc_sent_bytes_total	The total number of bytes sent
etcd_network_client_grpc_received_bytes_total	The total number of bytes received
etcd_network_peer_round_trip_time_seconds	Round-Trip-Time histogram between peers

gRPC

Raft engine

WAL

bbolt

23 of 40

Observe Raft: Leader and follower

Raft implies there is a Leader and Followers

gRPC

WAL

bbolt

Metrics name	Helper
etcd_server_has_leader	Whether or not a leader exists.
etcd_server_leader_changes_seen_total	The number of leader changes seen.

Raft engine

24 of 40

Observe Raft: Leader and follower

Raft implies there is a Leader and a Follower
You can easily visualize who's leader or not with etcdctl

gRPC

Raft engine

$ etcdctl endpoint status -w table

+-------------------------+------------------+---------+---------+-----------+-----------+-------------+

+-------------------------+------------------+---------+---------+-----------+-----------+-------------+

| https://10.0.0.101:2379 | e1819b08938c139e | 3.4.9 | 5.7 GB | false | 4095 | 26636440814 |

| https://10.0.0.102:2379 | 5ca1e82d1803044f | 3.4.9 | 5.7 GB | false | 4095 | 26636440817 |

| https://10.0.0.103:2379 | 2180ef520bcb43f1 | 3.4.9 | 5.7 GB | true | 4095 | 26636440819 |

+-------------------------+------------------+---------+---------+-----------+-----------+-------------+

25 of 40

Observe Raft: Leader and follower

Leader is sending heartbeats to followers
Sometimes followers are taking too much time to respond

Generally because of slow disks/network

gRPC

WAL

bbolt

Metrics name	Helper
etcd_server_heartbeat_send_failures	The total number of leader heartbeat send failures

Raft engine

26 of 40

Observe Raft: Proposals

Mutating the data is called a proposal

gRPC

WAL

bbolt

Metrics name	Helper
etcd_server_proposals_pending	how many proposals are queued to commit.
etcd_server_proposals_committed_total	total number of consensus proposals committed.
etcd_server_proposals_applied_total	total number of consensus proposals applied
etcd_server_proposals_failed_total	The total number of failed proposals seen.

Raft engine

27 of 40

Observe Raft: slow calls

ETCD has some tracing for >100ms calls

gRPC

WAL

bbolt

Raft engine

28 of 40

Observe Raft: slow calls

ETCD has some tracing for >100ms calls
and of course metrics 😁

gRPC

WAL

bbolt

Metrics name	Helper
etcd_server_slow_apply_total	Total number of slowed applies
etcd_server_slow_read_indexes_total	Total number of slow read executed on a Follower
etcd.debugging.mvcc.slow_watcher.total	Total number of unsynced slow watchers.

Raft engine

29 of 40

Observe ETCD: WALs

WALs stands for write-ahead log
Any mutation will be written before processing

🚀 SSD at the very least, NVMe is important here
🚫 no network volume device

Raft engine

gRPC

bbolt

WAL

Metrics name	Helper
etcd_disk_wal_fsync_duration_seconds	The latency distributions of fsync called by wal

30 of 40

Observe ETCD: WALs

WALs stands for write-ahead log
Any mutation will be written before processing

🚀 SSD at the very least, NVMe is important here
🚫 no network volume device

Raft engine

gRPC

bbolt

WAL

31 of 40

Operating ETCD: storage size

gRPC

Raft engine

WAL

bbolt

Metrics name	Helper
etcd_server_quota_backend_bytes	Current backend storage quota size in bytes.
etcd_mvcc_db_total_size_in_bytes	Current backend storage size in bytes.

32 of 40

Tips and tricks on

Pierre Zemb

Technical Leader OVHcloud

33 of 40

Operating ETCD: compaction

Since etcd keeps an exact history of its keyspace, this history should be periodically compacted
Can be automated or controlled by the client

gRPC

WAL

bbolt

Raft engine

# Number of revisions or time in hour before compacting

--auto-compaction-retention 1 \

# Available values: periodic or revision

--auto-compaction-mode periodic

34 of 40

Operating ETCD: storage size

bbolt uses a memory-mapped file so the underlying operating system handles the caching of the data.

⚠️ Do not cap ETCD’s memory below max data limit ⚠️

gRPC

Raft engine

WAL

bbolt

35 of 40

Operating ETCD: defrag

After compacting the keyspace, bbolt may exhibit internal fragmentation.

gRPC

Raft engine

WAL

bbolt

# run locally

etcdctl defrag

36 of 40

Operating ETCD: backups

The embedded tool is good, you can use it 😎

based on the Snapshot feature in Raft
Snapshots are local

it is best to trigger it on a follower

etcdctl snapshot save

37 of 40

Operating ETCD: cluster operation

In order to replace a failed machine, you need to:

remove the faulty member with etcdctl
add a new member with etcdctl

👉 Automate this process 👈

38 of 40

Using ETCD: client configuration

Enable DialKeepAliveTime and DialKeepAliveTimeout on ETCD's clients

39 of 40

TL;DR

Observe everything, from gRPC to I/O latencies
Triple check disks and network latencies
Raise storage size limit to 8GB
If you need more than 8GB, --etcd-servers-overrides=/events is your friend 😁
Dedicated machines with SSD at the very least, or NVMe
Don't be stingy on RAM limits (at least twice the storage size limit)
Compact and defrag members
Run snapshots on Followers
Automate membership alterations to ease on-call duty
Enable DialKeepAliveTime and DialKeepAliveTimeout on ETCD's clients

40 of 40

Slides

Twitter

Github

https://pierrezemb.fr

PierreZ

Thanks!

Do you have some questions?

Visit OVHcloud's

virtual booth!