1
Lessons learned
from operating
Pierre Zemb
Technical Leader OVHcloud
$ whoami
2
$ whoami
3
Schedule
4
OVHcloud: a global leader
5
Web Cloud & Telcom
Private Cloud
Public Cloud
Bare Metal Cloud
Network & Security
31 Data Centers �in 12 locations
48 Points of Presence�on a 20 TBPS Bandwidth Network
2450+ Employees�worldwide
117K Private Cloud�VMS running
395K Physical Servers�running in our data centers
1 Million+ Servers
produced since 1999
340K Public Cloud�instances running
1.6 Million Customers�across 132 countries
1.5 Billion Euros Invested�since 2016
20+ Years in Business�Disrupting since 1999
P.U.E. 1.14�Energy efficiency indicator
6 Million Websites�hosting
OVHcloud: a global leader
6
What is ETCD?
7
ETCD in Kubernetes
8
ETCD in Kubernetes
9
ETCD in Kubernetes
10
Kubinception
11
Kubinception
12
Customer 1
control-plane
Customer 2
control-plane
Kubinception
13
Kubeinception implication for ETCD
That kind of workload is putting a lot of 😓 to ETCD:
14
15
Lessons learned
from operating
Pierre Zemb
Technical Leader OVHcloud
16
Observe
Pierre Zemb
Technical Leader OVHcloud
Enable observability
17
--metrics extensive \
--logger 'zap'
What to observe on ETCD?
🔍🤔
18
4 layers to watch
19
gRPC
4 layers to watch
20
gRPC
Raft engine
4 layers to watch
21
gRPC
Raft engine
WAL
bbolt
Observe ETCD: gRPC
22
Metrics name | Helper |
grpc_server_handled_total | Total number of RPCs per gRPC method |
etcd_network_client_grpc_sent_bytes_total | The total number of bytes sent |
etcd_network_client_grpc_received_bytes_total | The total number of bytes received |
etcd_network_peer_round_trip_time_seconds | Round-Trip-Time histogram between peers |
gRPC
Raft engine
WAL
bbolt
Observe Raft: Leader and follower
23
gRPC
WAL
bbolt
Metrics name | Helper |
etcd_server_has_leader | Whether or not a leader exists. |
etcd_server_leader_changes_seen_total | The number of leader changes seen. |
Raft engine
Observe Raft: Leader and follower
24
gRPC
Raft engine
$ etcdctl endpoint status -w table
+-------------------------+------------------+---------+---------+-----------+-----------+-------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-------------------------+------------------+---------+---------+-----------+-----------+-------------+
| https://10.0.0.101:2379 | e1819b08938c139e | 3.4.9 | 5.7 GB | false | 4095 | 26636440814 |
| https://10.0.0.102:2379 | 5ca1e82d1803044f | 3.4.9 | 5.7 GB | false | 4095 | 26636440817 |
| https://10.0.0.103:2379 | 2180ef520bcb43f1 | 3.4.9 | 5.7 GB | true | 4095 | 26636440819 |
+-------------------------+------------------+---------+---------+-----------+-----------+-------------+
Observe Raft: Leader and follower
25
gRPC
WAL
bbolt
Metrics name | Helper |
etcd_server_heartbeat_send_failures | The total number of leader heartbeat send failures |
Raft engine
Observe Raft: Proposals
26
gRPC
WAL
bbolt
Metrics name | Helper |
etcd_server_proposals_pending | how many proposals are queued to commit. |
etcd_server_proposals_committed_total | total number of consensus proposals committed. |
etcd_server_proposals_applied_total | total number of consensus proposals applied |
etcd_server_proposals_failed_total | The total number of failed proposals seen. |
Raft engine
Observe Raft: slow calls
27
gRPC
WAL
bbolt
Raft engine
Observe Raft: slow calls
28
gRPC
WAL
bbolt
Metrics name | Helper |
etcd_server_slow_apply_total | Total number of slowed applies |
etcd_server_slow_read_indexes_total | Total number of slow read executed on a Follower |
etcd.debugging.mvcc.slow_watcher.total | Total number of unsynced slow watchers. |
Raft engine
Observe ETCD: WALs
29
Raft engine
gRPC
bbolt
WAL
Metrics name | Helper |
etcd_disk_wal_fsync_duration_seconds | The latency distributions of fsync called by wal |
Observe ETCD: WALs
30
Raft engine
gRPC
bbolt
WAL
Operating ETCD: storage size
31
gRPC
Raft engine
WAL
bbolt
Metrics name | Helper |
etcd_server_quota_backend_bytes | Current backend storage quota size in bytes. |
etcd_mvcc_db_total_size_in_bytes | Current backend storage size in bytes. |
32
Tips and tricks on
Pierre Zemb
Technical Leader OVHcloud
Operating ETCD: compaction
33
gRPC
WAL
bbolt
Raft engine
# Number of revisions or time in hour before compacting
--auto-compaction-retention 1 \
# Available values: periodic or revision
--auto-compaction-mode periodic
Operating ETCD: storage size
⚠️ Do not cap ETCD’s memory below max data limit ⚠️
34
gRPC
Raft engine
WAL
bbolt
Operating ETCD: defrag
35
gRPC
Raft engine
WAL
bbolt
# run locally
etcdctl defrag
Operating ETCD: backups
36
etcdctl snapshot save
Operating ETCD: cluster operation
👉 Automate this process 👈
37
Using ETCD: client configuration
Enable DialKeepAliveTime and DialKeepAliveTimeout on ETCD's clients
38
TL;DR
39
40
Slides
Github
Thanks!
Do you have some questions?
Visit OVHcloud's
virtual booth!