1
Nezha: Deployable and High-Performance Consensus Using Accurately Synchronized Clocks �
�Jinkun Geng*, Anirudh Sivaraman+, Balaji Prabhakar*, and Mendel Rosenblum*
*Stanford University, +New York University
VLDB 2023
Project open-sourced at https://github.com/Steamgjk/Nezha
Crash-Fault-Tolerant (CFT) Consensus
2
Storage/DBs: BigTable, TiKV, CockroachDB, MongoDB, RedisRaft, Delos, TiDB, Spanner
Coordination: Chubby, Zookeeper, etcd
Messaging: RabbitMQ, Apache Kafka
DNS Servers: CoreDNS
Container orchestrion tools: Docker Swarm, Kubernetes …
1980s
Paxos,
Viewstamped
Replication
2003
Fast Paxos
2004
Generalized
Paxos
2008
Mencius
EPaxos
2013
2014
Raft
Speculative Paxos,
NetPaxos
2015
2016
NOPaxos
APUS
2017
NetChain
2018
2020
Mu
3
Yet, consensus is often still a bottleneck
Why consensus is still slow: single-leader protocols
🡪 Single-leader protocols, e.g., Raft and Multi-Paxos.
4
Follower
Client
Follower
Leader
Client
Client
…
…
Raft/Multi-Paxos
Leader bottleneck
Fixing single-leader bottlenecks through optimistic protocols
5
Follower
Client
Follower
Leader
Client
Client
…
…
NOPaxos
Combating message reordering in the network
6
R1
R2
S
A
A
B
B
Multiple non-equal-delay paths between S and R1/R2 =>
messages arrive at different orders at different receivers
Combating message reordering in the network
7
Force S->R1 and S->R2 to have same delay (e.g., SpecPaxos, NetPaxos, NOPaxos):
Multiple non-equal-delay paths between S and R1/R2 =>
messages arrive at different orders at different receivers
R1
R2
S
A
A
B
B
Control over network (e.g., routing)
Our work: helping cloud tenants who can’t control routing
8
9
S
R4
R1
R2
R3
Rn
Deadline: 0009 | …
. . . . . . .
0000
0000
0000
0000
0000
0000
0001
0001
0001
0001
0001
0001
0002
0002
0002
0002
0002
0002
0003
0003
0003
0003
0003
0003
0004
0004
0004
0004
0004
0004
0005
0005
0005
0005
0005
0005
0006
0006
0006
0006
0006
0006
0007
0007
0007
0007
0007
0007
0008
0008
0008
0008
0008
0008
0009
0009
0009
0009
0009
0009
Deadline: 0009 | …
Deadline: 0009 | …
Deadline: 0009 | …
Deadline: 0009 | …
Deadline: 0009 | …
Senders and receivers synchronized to the same clock
Sender attaches deadline to message
Receiver processes message when deadline is reached on receiver clock
Sender multicasts message to all receivers
Messages arrive at unpredictable times
Deadline-Ordered Multicast (DOM)
and multiple messages in deadline order
10
DOM deadline estimation
Defining a reordering score to evaluate DOM effectiveness
1
2
3
4
1
2
3
4
R1 Sequence
R2 Sequence
1
2
3
4
1
2
3
4
R1 Sequence
R2 Sequence
11
Time
DOM reduces message reordering in cloud (Google Cloud)
12
2 Receivers + 10 Senders multicasting at 10K requests/sec each
Larger percentile 🡪 Larger deadline
Takeaway-1: Message reordering occurs indeed frequently in the public cloud
🡪 Much room for performance improvement of consensus protocols
Takeaway-2: DOM is a best-effort primitive, i.e., DOM reduces but not eliminates reordering
🡪 Our protocol should also handle inconsistency in the slow path
Nezha fast path: All messages arrive before their deadlines
13
ddl=10
ddl=12
t=
Leader
Follower-1
Client-1
Client-2
Follower-2
0
9
10
Committed
Committed
12
Nezha slow path
14
ddl=10
ddl=12
t=
Leader
Follower-1
Client-1
Client-2
Follower-2
0
Not Committed
Not
Committed
Committed
Committed
10
12
13
Too late!
Fix inconsistency
Too late!
Benefit of fast path vs. slow path structure: Use synchronized clocks for acceleration, not correctness
Evaluation
15
16
Comparison to memory-based protocols (log is in memory)
17
Comparison to disk-based protocols
Conclusion
18
19
Comparison for diskless protocols (final copy in memory)
* NOPaxos with a software
sequencer because
programmable switches
are not available in the public cloud.
*
*
20
Nezha vs. Raft (data persisted to disk)
4.5K
30.7K
78.9K
29.7K
87.1K
Latency (ms)
Throughput (x1K reqs/sec)
Closed-Loop Test
Open-Loop Test
21
Comparison for diskless protocols (final copy in memory)
9.7K
25K
69K
109K
203K
12K
22K
80K
191K
Latency (μs)
Throughput (x1K reqs/sec)
Closed-Loop Test
Open-Loop Test
*
*
because programmable switches are not available in the public cloud.
22
Yet, consensus is often still a bottleneck (Pulsar as an Example)
Brokers
Producer
Consumer
Dispatcher
Bookies
Bookies
Bookkeeper
BK Client
Local Zookeeper1
Global Replicator
Global Zookeeper
1. The latest version of Pulsar also supports other types of consensus modules including etcd (i.e., an Raft Implementation)
Persist the meta data (e.g., the topic partition info and the message offset on bookies)
Store the data (topic-based messages)
Bottleneck
Bottleneck
Figure adapted from https://pulsar.apache.org/docs/next/administration-zk-bk/
Fast path
23
DOM-S tags the deadline to the request
DOM-S multicasts the request to all replicas, and it is accepted by the Early Buffer
DOM-R releases the request according to its deadline and appends to log
Leader executes the request
Replicas compute a hash and reply to the client for quorum check
①
②
③
④
⑤
⑤
Proxy
Quorum Check
Follower
1
2
3
4
5
6
Log
Leader
1
2
3
4
5
6
Log
Follower
State Machine
④
②
②
②
⑤
①
⑤
DOM-S
DOM-R
1
2
3
4
5
6
Log
Requests from clients
Replies to clients
Late Buffer
Early Buffer
③
DOM-R
Late Buffer
Early Buffer
DOM-R
③
Early Buffer
③
Late Buffer
Slow path
24
DOM-S tags the deadline to the request
DOM-S multicasts the request to all replicas, and it is accepted by the Late Buffer
Leader modifies the request deadline to make it eligible to enter the Early Buffer
DOM-R releases the request from the leader’s Early Buffer
Leader executes the request
Leader replies to the client
Leader broadcasts its log indices to the followers
Followers modify their logs to keep consistent with the leader
Followers fetch missing logs from their Late Buffers
Followers reply to proxy for quorum check
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
Proxy
Quorum Check
Follower
1
2
3
4
5
6
Log
Leader
1
2
3
4
5
6
Log
Follower
State Machine
②
②
②
⑥
①
DOM-S
1
2
3
4
5
6
Log
Replies to clients
③
DOM-R
⑨
⑧
⑧
⑦
⑤
⑩
⑩
Early Buffer
Late Buffer
⑨
④
Early Buffer
Late Buffer
Early Buffer
Late Buffer
DOM-R
Requests from clients
Related Work
25
Why consensus is still slow: the tax of coordination
26
Client-1
Client-2
Replica-1
xxxxx
Replica-2
Replica-3
log list
Consensus: Replicate the same requests
across multiple servers
Why consensus is still slow: the tax of coordination
27
Client-1
Client-2
Replica-1
xxxxx
Replica-2
Replica-3
log list
Consensus: Replicate the same requests
across multiple servers
Reducing the need for coordination
28
Message reordering in the network
Mis-ordering happens
because
Reduce message reordering
Prior work uses highly-engineered networks
Speculative Paxos,
NetPaxos, etc.
Equalize transmission path
In-network serialization
NOPaxos, NetChain,
Hydra, etc.
29
2 n1-standard-16 receivers + 2 n1-standard-4 senders
2 n1-standard-16 receivers + multiple n1-standard-4 senders (each multicasting at 10K requests/sec)
Serious message reordering in public cloud
🡪 Much room to improve protocol performance
Per-Sender Multicast Rate | 20 | 40 | 60 | 80 | 100 |
Reordering Score | 13.9 | 28.2 | 28.5 | 30.0 | 33.0 |
Number of Senders | 2 | 4 | 6 | 8 | 10 |
Reordering Score | 7.3 | 23.6 | 31.3 | 39.5 | 42.7 |
Quantifying Reordering in the Public Cloud (Google Cloud)
Nezha: High-performance and cloud native consensus
🡪 Use synchronized clocks through a new multicast primitive: deadline-ordered multicast
30
Performance
Network Support
NOPaxos
NetChain
SpecPaxos, NetPaxos
Paxos, Raft
Nezha
Hydra
31
How are deadlines computed?
The Nezha consensus protocol
32
Follower
Client
Follower
Leader
Client
Client
…
…
Proxy
Proxy
…
…
…
🡪 Easy to fix inconsistency between leader and followers
DOM sender
DOM receiver
DOM receiver
DOM receiver
Scalability with Replicas
33
Closed-loop Test
Open-loop Test
150 Clients
10 Clients
Bottleneck at clients
Proxy Evaluation
34
9 Replicas
9 Replicas
35
Comparison in WAN
Latency: 1.51x-3.37x; Throughput: 2.55x-10.13x
WAN setting further demonstrates Nezha’s advantage 🡪 1 WAN RTT (optimal) latency
Application (Redis)
36
Application (CloudEx)
37
38
Failure Recovery
(150ms-300ms)
39
Failure Recovery
20K reqs/sec
100K reqs/sec
200K reqs/sec
Prior Works: Raft/Multi-Paxos and NOPaxos
40
Follower
Client
Follower
Leader
Client
Client
…
…
heavy
Raft/Multi-Paxos
Prior Works
41
Follower
Client
Follower
Leader
Client
Client
…
…
Follower
Follower
Leader
…
…
Client
Quorum Check
Client
libnopaxos
Quorum Check
Client
Quorum Check
Sequencer
Highly Engineered Network
NOPaxos
libnopaxos
libnopaxos
client-intrusive
& heavy
Raft/Multi-Paxos
Drawbacks of Prior Works
42
Performance
Network Support
NOPaxos
NetChain
SpecPaxos, NetPaxos
Paxos, Raft
Nezha
Hydra
DOM: Reduce Reordering
43
Receiver-1
Receiver-2
Sender-1
Sender -2
Set a deadline ddl=10
Set a deadline
ddl=12
Hold the early-coming message until ddl
t=10
t=12
DOM: Reduce Reordering in Best Effort
44
Set a deadline ddl=10
Set a deadline
ddl=12
DOM cannot rectify such reordering cases
t=10
t=12
Receiver-1
Receiver-2
Sender-1
Sender -2
Too late!
45
S
R1
R2
Rn
. . .
DOM-deadline estimation
When S multicasts to R1-Rn:
Larger percentile 🡪 Larger deadline