1 of 45

1

Nezha: Deployable and High-Performance Consensus Using Accurately Synchronized Clocks �

�Jinkun Geng^*, Anirudh Sivaraman⁺, Balaji Prabhakar^*, and Mendel Rosenblum^*

^*Stanford University, ⁺New York University

VLDB 2023

Project open-sourced at https://github.com/Steamgjk/Nezha

2 of 45

Crash-Fault-Tolerant (CFT) Consensus

A long and rich research literature

2

Substantial impact on the practice of distributed systems

Storage/DBs: BigTable, TiKV, CockroachDB, MongoDB, RedisRaft, Delos, TiDB, Spanner

Coordination: Chubby, Zookeeper, etcd

Messaging: RabbitMQ, Apache Kafka

DNS Servers: CoreDNS

Container orchestrion tools: Docker Swarm, Kubernetes …

1980s

Paxos,

Viewstamped

Replication

2003

Fast Paxos

2004

Generalized

Paxos

2008

Mencius

EPaxos

2013

2014

Raft

Speculative Paxos,

NetPaxos

2015

2016

NOPaxos

APUS

2017

NetChain

2018

2020

Mu

3 of 45

Databases: Cosmos DB trades weaker consistency levels for higher throughput.¹
KV stores: Single Raft group cannot support the workload demanded by TiKV.²
Stream processing and message buses: Zookeeper constrains Apache Kafka/Pulsar from supporting more partitions.^3,4
Container orchestration: etcd (Raft) can often become the bottleneck to Kubernetes on managing large-scale clusters.⁵
…

3

Yet, consensus is often still a bottleneck

Microsoft, 2023. Consistency levels in Azure Cosmos DB
Huang, et al. 2019. TiDB: A Raft-based HTAP Database, in VLDB’19
Colin, McCabe, 2020. Apache Kafar KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
StreamNative, 2022. Moving Toward a ZooKeeper-Less Apache Pulsar
Andrew, et al. 2021. Rearchitecting Kubernetes for the Edge, in EdgeSys’21

4 of 45

Why consensus is still slow: single-leader protocols

Goal of consensus: Establish a common ordering of client operations across multiple replicas by maintaining a replicated ordered log of these operations.
Simplest way to achieve consensus : serialize all operations with one replica

🡪 Single-leader protocols, e.g., Raft and Multi-Paxos.

4

Follower

Client

Follower

Leader

Client

…

Raft/Multi-Paxos

Leader bottleneck

Which brings us to the question: why is consensus still slow?

Let’s start with the goal of consensus: One (arguably, the most important) use of consensus is in the context of state machine replication, where we are trying to make a service such as a key-value store highly available through redundancy. Here, the goal of the consensus algorithm is to establish a common ordering of client operations across multiple machines by maintaining a replicated log of client operations (like GETs and PUTs) across these machines.

One of the reasons is the tax of coordination, i.e., the work required for different machines to agree on a common ordering of client operations.

Incidentally, this tax of coordination applies to other distributed protocols like concurrency control; we will focus exclusively on consensus here though.

--> What is the simplest way to achieve that goal?

Serialize everything from the client to the RSM system via a single leader.

--> But then that leader becomes a choke point.

5 of 45

Fixing single-leader bottlenecks through optimistic protocols

Goal of consensus: Establish a common ordering of client operations across multiple machines by maintaining a replicated ordered log of these operations.
A better way: optimistic protocols:

client (or proxy) multicasts operations to all replicas to spread the load
If operations arrive in the same order at multiple replicas, ordering is achieved for free, and replicas respond directly to clients (Fast Path)
If not, use leader to decide order w/ extra message delays like Paxos (Slow Path)

5

Follower

Client

Follower

Leader

Client

…

NOPaxos

Many examples of optimistic protocols

Fast Paxos, SpecPaxos, NOPaxos, Epaxos
Underlying assumption: fast path is indeed common.
But, what if---thanks to the network---client operations (i.e., messages) are received in different orders at different replicas?

6 of 45

Combating message reordering in the network

6

R1

R2

S

A

B

Multiple non-equal-delay paths between S and R1/R2 =>

messages arrive at different orders at different receivers

7 of 45

Combating message reordering in the network

7

Force S->R1 and S->R2 to have same delay (e.g., SpecPaxos, NetPaxos, NOPaxos):

One way: control message routing
Substantial improvements in performance relative to Raft, Paxos, etc.

Multiple non-equal-delay paths between S and R1/R2 =>

messages arrive at different orders at different receivers

R1

R2

S

A

B

Control over network (e.g., routing)

8 of 45

Our work: helping cloud tenants who can’t control routing

8

Answer: Synchronized Clocks
Counterintuitive idea: To equalize delay between sender and all its receivers, hold message at any receiver until message arrives at every other receiver

Small extra delay in the fast path -> increases fast path latency, but also increases fast path frequency

Deadline-Ordered Multicast (DOM) primitive and Nezha protocol
Why now?: Enabled by emerging availability of high-accuracy clock synchronization as a cloud service: Clockwork, Chrony, AWS Clock Bounds, Huygens (NSDI 2018)
Design principle: Clock synchronization for acceleration, not correctness

9 of 45

9

S

R4

R1

R2

R3

Rn

Deadline: 0009 | …

. . . . . . .

0000

0001

0002

0003

0004

0005

0006

0007

0008

0009

Deadline: 0009 | …

Senders and receivers synchronized to the same clock

Sender attaches deadline to message

Receiver processes message when deadline is reached on receiver clock

Sender multicasts message to all receivers

Messages arrive at unpredictable times

Deadline-Ordered Multicast (DOM)

and multiple messages in deadline order

10 of 45

Use background probes to estimate x^th percentile one-way delay (OWD) between the sender and each receiver
Deadline = send timestamp + max of these OWD estimates across receivers
Key tradeoff: reordering rate vs. additional latency from sender to receiver

Larger percentile -> Larger deadline: more latency, but less reordering
We use 50^th percentile and Huygens (NSDI 2018) to sync clocks with a median accuracy of ~100 ns.

10

DOM deadline estimation

11 of 45

Defining a reordering score to evaluate DOM effectiveness

1

2

3

4

1

2

3

4

R1 Sequence

R2 Sequence

1

2

3

4

1

2

3

4

R1 Sequence

R2 Sequence

11

Time

12 of 45

DOM reduces message reordering in cloud (Google Cloud)

12

2 Receivers + 10 Senders multicasting at 10K requests/sec each

Larger percentile 🡪 Larger deadline

Takeaway-1: Message reordering occurs indeed frequently in the public cloud

🡪 Much room for performance improvement of consensus protocols

Takeaway-2: DOM is a best-effort primitive, i.e., DOM reduces but not eliminates reordering

🡪 Our protocol should also handle inconsistency in the slow path

13 of 45

Nezha fast path: All messages arrive before their deadlines

13

ddl=10

ddl=12

t=

Leader

Follower-1

Client-1

Client-2

Follower-2

0

9

10

Committed

12

14 of 45

Nezha slow path

14

ddl=10

ddl=12

t=

Leader

Follower-1

Client-1

Client-2

Follower-2

0

Not Committed

Not

Committed

10

12

13

Too late!

Fix inconsistency

Too late!

Benefit of fast path vs. slow path structure: Use synchronized clocks for acceleration, not correctness

But in this case, we will see that DOM fails to eliminate all reordering, so that these requests cannot be committed in the fast path.
[Note that, here in this case, the leader has received the two requests according to their deadline order, so there is no need for deadline modification on the leader. But in more general cases, when a request with a smaller deadline arrives at the leader after the leader has released another request with larger deadline, then the leader will do deadline modification to the incoming request to make its request larger than the previously released one (to guarantee consistent ordering requrirement)]

When these two requests are multicast, the blue request only arrives at the leader before its deadline, but arrive at followers very late, even after the orange requests have already been released. In this case, DOM cannot address the reordering, so these followers have inconsistent states from the leader. Specifically, leader’s log has blue request appended first, and orange request appended second. But follower’s log has orange request appended first. Therefore, even Proxy-2 receive the replies all replicas in the fast path, Proxy-2 does not consider the orange request committed. [We have included a hash value in these replies so that Proxy-2 can easily know leaders and followers have inconsistent logs]

As a result, we ask the leader to take an active role: the leader comes to fix the inconsistency by modifying the follower’s log list, so that followers also put the blue request ahead of the orange request. After followers have modified their logs properly, the blue request can be committed in the slow path first, and then orange request is also committed in the slow path.

The leader sends [redundant ] slow reply (this can be omitted as an optimization).

15 of 45

Evaluation

3 baselines: Multi-Paxos, Fast Paxos, Raft (ATC 14)
Workload: each client maintains 1 outstanding request, vary number of clients

2 versions of Nezha: Nezha-Proxy and Nezha-Client

All experiments are conducted in Google Cloud (us-central-1a)

3 Replicas: n1-standard-16 VMs
Clients: n1-standard-4 VMs
5 Proxies: n1-standard-32 VMs

https://arxiv.org/pdf/2206.03285.pdf has more extensive evaluations with more baselines: NOPaxos reimplemented in software (OSDI 16), TOQ (NSDI 21), Domino (CoNext 20)

15

16 of 45

16

Comparison to memory-based protocols (log is in memory)

Nezha-Proxy vs. Nezha-Client: A proxy increases latency, but saves client CPU

Nezha’s improves performance because of (1) reordering reduction + (2) reducing load on leader

17 of 45

We equip Nezha with disk persistence 🡪 Achieve same fault tolerance guarantee as Raft
Raft-1: Diego Ongaro’s implementation; Raft-2: Our implementation based on Multi-Paxos

17

Comparison to disk-based protocols

Again, Nezha improves performance, but absolute numbers are worse because of disk

18 of 45

Conclusion

18

Simple idea to accelerate optimistic consensus protocols:

Synchronize client and replica clocks.
Deadline based on one-way delay attached to each operation, multicast operation to replicas.
Delay operation processing at replicas until deadline.
Order operations by deadline.

Increases fast path frequency at the cost of some increase in fast path latency
Future work: (1) Integrate with industrial systems (e.g., Kubernetes, cloud stock exchanges (CloudEx, HotOS’21)) (2) Integrate with concurrency control to support multi-sharded systems
Code + FAQ: https://github.com/Steamgjk/Nezha

19 of 45

19

Comparison for diskless protocols (final copy in memory)

Nezha-Proxy vs Nezha-Client: A proxy increases latency, but saves client CPUs

Nezha’s throughput speedup: 1.9-20.9x because of (1) reordering reduction + (2) reducing load on leader

* NOPaxos with a software

sequencer because

programmable switches

are not available in the public cloud.

*

20 of 45

We equip Nezha with disk persistence 🡪 Achieve same fault tolerance guarantee as Raft
Raft-1: Diego Ongaro’s implementation; Raft-2: Our implementation based on Multi-Paxos

20

Nezha vs. Raft (data persisted to disk)

4.5K

30.7K

78.9K

29.7K

87.1K

Latency (ms)

Throughput (x1K reqs/sec)

Nezha’s throughput speed up: 2.6-17.5x in closed-loop test; 2.9x in open-loop test.

Raft suffers from heavy leader bottleneck

Closed-Loop Test

Open-Loop Test

21 of 45

21

Comparison for diskless protocols (final copy in memory)

9.7K

25K

69K

109K

203K

12K

22K

80K

191K

Latency (μs)

Throughput (x1K reqs/sec)

Nezha-Proxy vs Nezha-Non-Proxy: Co-locating proxy and client can save more latency, but makes client do more work 🡪 Undesirable when client VMs are CPU limited

Nezha’s throughput speedup: 2.5-20.9x in closed-loop test; 2.5-9.0x in open-loop test.

Closed-Loop Test

Open-Loop Test

*

NOPaxos reimplemented with a software sequencer

because programmable switches are not available in the public cloud.

22 of 45

22

Yet, consensus is often still a bottleneck (Pulsar as an Example)

Brokers

Producer

Consumer

Dispatcher

Bookies

Bookkeeper

BK Client

Local Zookeeper¹

Global Replicator

Global Zookeeper

1. The latest version of Pulsar also supports other types of consensus modules including etcd (i.e., an Raft Implementation)

Persist the meta data (e.g., the topic partition info and the message offset on bookies)

Store the data (topic-based messages)

Bottleneck

Figure adapted from https://pulsar.apache.org/docs/next/administration-zk-bk/

Yet, despite about 4 decades of research on this topic, consensus is often still a bottleneck. Often, the overall system’s throughput or latency is hurt by the throughput or latency supported by the underlying consensus protocol. And there are several examples of real systems that are affected by the speed of consensus.

Here we show the workflow of Apache pulsar.

While the producer produces the topic-based messages to the pulsar's brokers. The brokers need to contact the consensus module in its local cluster, shown as "Local Zookeeper" in this slide, and then the local zookeeper persists the meta data for the topic-based messages, such as the offset of this topic-based message on the bookies. After that, the real messages are stored in bookies. Bookie clients can fetch the stored messages from bookies by asking the local zookeeper for the message offset, and then deliver the messages to the consumers who have subscribed to the topic.

Besides, Pulsar also supports the global replication, which relies on a geo-distributed consensus module, shown as "Global Zookeeper" in the slide, to replicate the data across multiple regions.

While pulsar is handling large volume of data publication/subscription, both its local consensus and global consensus modules can become the bottleneck to constrain its performance.

23 of 45

Fast path

23

DOM-S tags the deadline to the request

DOM-S multicasts the request to all replicas, and it is accepted by the Early Buffer

DOM-R releases the request according to its deadline and appends to log

Leader executes the request

Replicas compute a hash and reply to the client for quorum check

①

②

③

④

⑤

Proxy

Quorum Check

Follower

1

2

3

4

5

6

Log

Leader

1

2

3

4

5

6

Log

Follower

State Machine

④

②

⑤

①

⑤

DOM-S

DOM-R

1

2

3

4

5

6

Log

Requests from clients

Replies to clients

Late Buffer

Early Buffer

③

DOM-R

Late Buffer

Early Buffer

DOM-R

③

Early Buffer

③

Late Buffer

24 of 45

Slow path

24

DOM-S tags the deadline to the request

DOM-S multicasts the request to all replicas, and it is accepted by the Late Buffer

Leader modifies the request deadline to make it eligible to enter the Early Buffer

DOM-R releases the request from the leader’s Early Buffer

Leader executes the request

Leader replies to the client

Leader broadcasts its log indices to the followers

Followers modify their logs to keep consistent with the leader

Followers fetch missing logs from their Late Buffers

Followers reply to proxy for quorum check

①

②

③

④

⑤

⑥

⑦

⑧

⑨

⑩

Proxy

Quorum Check

Follower

1

2

3

4

5

6

Log

Leader

1

2

3

4

5

6

Log

Follower

State Machine

②

⑥

①

DOM-S

1

2

3

4

5

6

Log

Replies to clients

③

DOM-R

⑨

⑧

⑦

⑤

⑩

Early Buffer

Late Buffer

⑨

④

Early Buffer

Late Buffer

Early Buffer

Late Buffer

DOM-R

Requests from clients

25 of 45

Related Work

Spanner and CockroachDB: Rely on clocks for correctness by assuming worst-case error bounds; hard to guarantee such worst-case bounds with software clock synchronization

Granola and CLOCC: Timestamp based on transaction arrival, rather than a deadline.

25

26 of 45

Why consensus is still slow: the tax of coordination

Mis-ordering of events: Consensus aims to consistently order events across multiple servers

26

Client-1

Client-2

Replica-1

xxxxx

Replica-2

Replica-3

log list

Consensus: Replicate the same requests

across multiple servers

27 of 45

Why consensus is still slow: the tax of coordination

Mis-ordering of events: Consensus aims to consistently order events across multiple servers

27

Client-1

Client-2

Replica-1

xxxxx

Replica-2

Replica-3

log list

Consensus: Replicate the same requests

across multiple servers

28 of 45

Reducing the need for coordination

Mis-ordering of events: Consensus aims to consistently order events across servers

28

Message reordering in the network

Mis-ordering happens

because

Reduce message reordering

What about the public cloud?

The tenant cannot engineer the network
The reordering problem is quite bad

Prior work uses highly-engineered networks

Speculative Paxos,

NetPaxos, etc.

Equalize transmission path

In-network serialization

NOPaxos, NetChain,

Hydra, etc.

29 of 45

29

2 n1-standard-16 receivers + 2 n1-standard-4 senders

2 n1-standard-16 receivers + multiple n1-standard-4 senders (each multicasting at 10K requests/sec)

Serious message reordering in public cloud

🡪 Much room to improve protocol performance

Per-Sender Multicast Rate	20	40	60	80	100
Reordering Score	13.9	28.2	28.5	30.0	33.0

Number of Senders	2	4	6	8	10
Reordering Score	7.3	23.6	31.3	39.5	42.7

Quantifying Reordering in the Public Cloud (Google Cloud)

30 of 45

Nezha: High-performance and cloud native consensus

Reduce message reordering in the public cloud to accelerate protocols

🡪 Use synchronized clocks through a new multicast primitive: deadline-ordered multicast

Why can synchronized clocks help?

To “equalize” the length between senders and receivers by adding endhost delays
Wait a little to avoid waiting even more to coordinate.

30

Performance

Network Support

NOPaxos

NetChain

SpecPaxos, NetPaxos

Paxos, Raft

Nezha

Hydra

31 of 45

Synchronize clocks on the sender and all receivers.

Measure sender-to-receiver one-way delays at all receivers.

Piggyback measurements back to sender

Sender computes the x^th percentile one-way delay to reach receiver over a sliding window of measurement samples.

Then picks maximum of these one-way delays across all receivers as the deadline.

31

How are deadlines computed?

32 of 45

The Nezha consensus protocol

Nezha employs stateless proxies to do (1) DOM multicast and (2) quorum check
Proxies (1) remove heavy burden from leader/client (2) enjoy much better scalability (3) make Nezha a direct replacement of Raft/Multi-Paxos
Install DOM :

Proxy 🡪 DOM Sender
Replica 🡪 DOM Receiver

Only leader executes requests; followers only maintain a log list

32

Follower

Client

Follower

Leader

Client

…

Proxy

…

🡪 Easy to fix inconsistency between leader and followers

DOM sender

DOM receiver

33 of 45

Scalability with Replicas

33

Closed-loop Test

Open-loop Test

150 Clients

10 Clients

Bottleneck at clients

34 of 45

Proxy Evaluation

34

The proxies enables single client to maintain high throughput
The proxies reduce CPU costs for clients
The proxies also provide latency benefit under high throughput load

9 Replicas

35 of 45

35

Comparison in WAN

Latency: 1.51x-3.37x; Throughput: 2.55x-10.13x

WAN setting further demonstrates Nezha’s advantage 🡪 1 WAN RTT (optimal) latency

36 of 45

Application (Redis)

36

20 closed-loop clients submit request under 10ms SLO
YCSB-A Workload (HMSET/HGETALL on 1000 keys)
Nezha is only within 6% that of the unreplicated system

37 of 45

Application (CloudEx)

37

48 open-loop traders + 16 gateways + 1 matching engine
Only Nezha saturates the matching engine processing capacity (~43K orders/sec) among the four protocols
Nezha prolongs end-to-end latency by 19.7% but achieves very close order processing latency

38 of 45

We kill the leader and measure

How long does it take for the remaining replicas to complete a view change with the new leader elected?

38

Failure Recovery

(150ms-300ms)

39 of 45

We kill the leader and measure

How long does it take for the remaining replicas to complete a view change with the new leader elected?
How long does it take to recover the throughput to the same level before crash?

39

Failure Recovery

20K reqs/sec

100K reqs/sec

200K reqs/sec

40 of 45

Prior Works: Raft/Multi-Paxos and NOPaxos

40

Follower

Client

Follower

Leader

Client

…

heavy

Raft/Multi-Paxos

41 of 45

Prior Works

41

Follower

Client

Follower

Leader

Client

…

Follower

Leader

…

Client

Quorum Check

Client

libnopaxos

Quorum Check

Client

Quorum Check

Sequencer

Highly Engineered Network

NOPaxos

libnopaxos

client-intrusive

& heavy

Raft/Multi-Paxos

42 of 45

Drawbacks of Prior Works

Our goal:

Both high performance and easy deployment 🡪 Use DOM achieve this goal

42

Performance

Network Support

NOPaxos

NetChain

SpecPaxos, NetPaxos

Paxos, Raft

Nezha

Hydra

43 of 45

DOM: Reduce Reordering

43

Receiver-1

Receiver-2

Sender-1

Sender -2

Set a deadline ddl=10

Set a deadline

ddl=12

Hold the early-coming message until ddl

t=10

t=12

44 of 45

DOM: Reduce Reordering in Best Effort

44

Set a deadline ddl=10

Set a deadline

ddl=12

DOM cannot rectify such reordering cases

t=10

t=12

Receiver-1

Receiver-2

Sender-1

Sender -2

Too late!

45 of 45

Use probe packets to estimate the one-way delay (OWD) between the sender and receivers.
The sender collects many OWD samples, and use a sliding-window approach to calculate the estimated OWD

45

S

R1

R2

Rn

. . .

DOM-deadline estimation

When S multicasts to R1-Rn:

Larger percentile 🡪 Larger deadline