1 of 45

1

Nezha: Deployable and High-Performance Consensus Using Accurately Synchronized Clocks

Jinkun Geng*, Anirudh Sivaraman+, Balaji Prabhakar*, and Mendel Rosenblum*

*Stanford University, +New York University

VLDB 2023

Project open-sourced at https://github.com/Steamgjk/Nezha

2 of 45

Crash-Fault-Tolerant (CFT) Consensus

  • A long and rich research literature

2

  • Substantial impact on the practice of distributed systems

Storage/DBs: BigTable, TiKV, CockroachDB, MongoDB, RedisRaft, Delos, TiDB, Spanner

Coordination: Chubby, Zookeeper, etcd

Messaging: RabbitMQ, Apache Kafka

DNS Servers: CoreDNS

Container orchestrion tools: Docker Swarm, Kubernetes …

1980s

Paxos,

Viewstamped

Replication

2003

Fast Paxos

2004

Generalized

Paxos

2008

Mencius

EPaxos

2013

2014

Raft

Speculative Paxos,

NetPaxos

2015

2016

NOPaxos

APUS

2017

NetChain

2018

2020

Mu

3 of 45

  • Databases: Cosmos DB trades weaker consistency levels for higher throughput.1
  • KV stores: Single Raft group cannot support the workload demanded by TiKV.2
  • Stream processing and message buses: Zookeeper constrains Apache Kafka/Pulsar from supporting more partitions.3,4
  • Container orchestration: etcd (Raft) can often become the bottleneck to Kubernetes on managing large-scale clusters.5

3

Yet, consensus is often still a bottleneck

  1. Microsoft, 2023. Consistency levels in Azure Cosmos DB
  2. Huang, et al. 2019. TiDB: A Raft-based HTAP Database, in VLDB’19
  3. Colin, McCabe, 2020. Apache Kafar KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
  4. StreamNative, 2022. Moving Toward a ZooKeeper-Less Apache Pulsar
  5. Andrew, et al. 2021. Rearchitecting Kubernetes for the Edge, in EdgeSys’21

4 of 45

Why consensus is still slow: single-leader protocols

  • Goal of consensus: Establish a common ordering of client operations across multiple replicas by maintaining a replicated ordered log of these operations.
  • Simplest way to achieve consensus : serialize all operations with one replica

🡪 Single-leader protocols, e.g., Raft and Multi-Paxos.

4

Follower

Client

Follower

Leader

Client

Client

Raft/Multi-Paxos

Leader bottleneck

5 of 45

Fixing single-leader bottlenecks through optimistic protocols

  • Goal of consensus: Establish a common ordering of client operations across multiple machines by maintaining a replicated ordered log of these operations.
  • A better way: optimistic protocols:
    • client (or proxy) multicasts operations to all replicas to spread the load
    • If operations arrive in the same order at multiple replicas, ordering is achieved for free, and replicas respond directly to clients (Fast Path)
    • If not, use leader to decide order w/ extra message delays like Paxos (Slow Path)

5

Follower

Client

Follower

Leader

Client

Client

NOPaxos

  • Many examples of optimistic protocols
    • Fast Paxos, SpecPaxos, NOPaxos, Epaxos
    • Underlying assumption: fast path is indeed common.
    • But, what if---thanks to the network---client operations (i.e., messages) are received in different orders at different replicas?

6 of 45

Combating message reordering in the network

6

R1

R2

S

A

A

B

B

Multiple non-equal-delay paths between S and R1/R2 =>

messages arrive at different orders at different receivers

7 of 45

Combating message reordering in the network

7

Force S->R1 and S->R2 to have same delay (e.g., SpecPaxos, NetPaxos, NOPaxos):

    • One way: control message routing
    • Substantial improvements in performance relative to Raft, Paxos, etc.

Multiple non-equal-delay paths between S and R1/R2 =>

messages arrive at different orders at different receivers

R1

R2

S

A

A

B

B

Control over network (e.g., routing)

8 of 45

Our work: helping cloud tenants who can’t control routing

8

  • Answer: Synchronized Clocks
  • Counterintuitive idea: To equalize delay between sender and all its receivers, hold message at any receiver until message arrives at every other receiver
    • Small extra delay in the fast path -> increases fast path latency, but also increases fast path frequency
  • Deadline-Ordered Multicast (DOM) primitive and Nezha protocol
  • Why now?: Enabled by emerging availability of high-accuracy clock synchronization as a cloud service: Clockwork, Chrony, AWS Clock Bounds, Huygens (NSDI 2018)
  • Design principle: Clock synchronization for acceleration, not correctness

9 of 45

9

S

R4

R1

R2

R3

Rn

Deadline: 0009 | …

. . . . . . .

0000

0000

0000

0000

0000

0000

0001

0001

0001

0001

0001

0001

0002

0002

0002

0002

0002

0002

0003

0003

0003

0003

0003

0003

0004

0004

0004

0004

0004

0004

0005

0005

0005

0005

0005

0005

0006

0006

0006

0006

0006

0006

0007

0007

0007

0007

0007

0007

0008

0008

0008

0008

0008

0008

0009

0009

0009

0009

0009

0009

Deadline: 0009 | …

Deadline: 0009 | …

Deadline: 0009 | …

Deadline: 0009 | …

Deadline: 0009 | …

Senders and receivers synchronized to the same clock

Sender attaches deadline to message

Receiver processes message when deadline is reached on receiver clock

Sender multicasts message to all receivers

Messages arrive at unpredictable times

Deadline-Ordered Multicast (DOM)

and multiple messages in deadline order

10 of 45

  • Use background probes to estimate xth percentile one-way delay (OWD) between the sender and each receiver
  • Deadline = send timestamp + max of these OWD estimates across receivers
  • Key tradeoff: reordering rate vs. additional latency from sender to receiver
    • Larger percentile -> Larger deadline: more latency, but less reordering
    • We use 50th percentile and Huygens (NSDI 2018) to sync clocks with a median accuracy of ~100 ns.

10

DOM deadline estimation

11 of 45

Defining a reordering score to evaluate DOM effectiveness

  •  

1

2

3

4

1

2

3

4

R1 Sequence

R2 Sequence

 

1

2

3

4

1

2

3

4

R1 Sequence

R2 Sequence

 

11

Time

12 of 45

DOM reduces message reordering in cloud (Google Cloud)

12

2 Receivers + 10 Senders multicasting at 10K requests/sec each

Larger percentile 🡪 Larger deadline

Takeaway-1: Message reordering occurs indeed frequently in the public cloud

🡪 Much room for performance improvement of consensus protocols

Takeaway-2: DOM is a best-effort primitive, i.e., DOM reduces but not eliminates reordering

🡪 Our protocol should also handle inconsistency in the slow path

13 of 45

Nezha fast path: All messages arrive before their deadlines

13

ddl=10

ddl=12

t=

Leader

Follower-1

Client-1

Client-2

Follower-2

0

9

10

Committed

Committed

12

14 of 45

Nezha slow path

14

ddl=10

ddl=12

t=

Leader

Follower-1

Client-1

Client-2

Follower-2

0

Not Committed

Not

Committed

Committed

Committed

10

12

13

Too late!

Fix inconsistency

Too late!

Benefit of fast path vs. slow path structure: Use synchronized clocks for acceleration, not correctness

15 of 45

Evaluation

  • 3 baselines: Multi-Paxos, Fast Paxos, Raft (ATC 14)
  • Workload: each client maintains 1 outstanding request, vary number of clients
    • 2 versions of Nezha: Nezha-Proxy and Nezha-Client
  • All experiments are conducted in Google Cloud (us-central-1a)
    • 3 Replicas: n1-standard-16 VMs
    • Clients: n1-standard-4 VMs
    • 5 Proxies: n1-standard-32 VMs
  • https://arxiv.org/pdf/2206.03285.pdf has more extensive evaluations with more baselines: NOPaxos reimplemented in software (OSDI 16), TOQ (NSDI 21), Domino (CoNext 20)

15

16 of 45

16

Comparison to memory-based protocols (log is in memory)

  • Nezha-Proxy vs. Nezha-Client: A proxy increases latency, but saves client CPU
  • Nezha’s improves performance because of (1) reordering reduction + (2) reducing load on leader

17 of 45

  • We equip Nezha with disk persistence 🡪 Achieve same fault tolerance guarantee as Raft
  • Raft-1: Diego Ongaro’s implementation; Raft-2: Our implementation based on Multi-Paxos

17

Comparison to disk-based protocols

  • Again, Nezha improves performance, but absolute numbers are worse because of disk

18 of 45

Conclusion

18

  • Simple idea to accelerate optimistic consensus protocols:
    • Synchronize client and replica clocks.
    • Deadline based on one-way delay attached to each operation, multicast operation to replicas.
    • Delay operation processing at replicas until deadline.
    • Order operations by deadline.
  • Increases fast path frequency at the cost of some increase in fast path latency
  • Future work: (1) Integrate with industrial systems (e.g., Kubernetes, cloud stock exchanges (CloudEx, HotOS’21)) (2) Integrate with concurrency control to support multi-sharded systems
  • Code + FAQ: https://github.com/Steamgjk/Nezha

19 of 45

19

Comparison for diskless protocols (final copy in memory)

  • Nezha-Proxy vs Nezha-Client: A proxy increases latency, but saves client CPUs
  • Nezha’s throughput speedup: 1.9-20.9x because of (1) reordering reduction + (2) reducing load on leader

* NOPaxos with a software

sequencer because

programmable switches

are not available in the public cloud.

*

*

20 of 45

  • We equip Nezha with disk persistence 🡪 Achieve same fault tolerance guarantee as Raft
  • Raft-1: Diego Ongaro’s implementation; Raft-2: Our implementation based on Multi-Paxos

20

Nezha vs. Raft (data persisted to disk)

4.5K

30.7K

78.9K

29.7K

87.1K

Latency (ms)

Throughput (x1K reqs/sec)

  • Nezha’s throughput speed up: 2.6-17.5x in closed-loop test; 2.9x in open-loop test.
  • Raft suffers from heavy leader bottleneck

Closed-Loop Test

Open-Loop Test

21 of 45

21

Comparison for diskless protocols (final copy in memory)

9.7K

25K

69K

109K

203K

12K

22K

80K

191K

Latency (μs)

Throughput (x1K reqs/sec)

  • Nezha-Proxy vs Nezha-Non-Proxy: Co-locating proxy and client can save more latency, but makes client do more work 🡪 Undesirable when client VMs are CPU limited
  • Nezha’s throughput speedup: 2.5-20.9x in closed-loop test; 2.5-9.0x in open-loop test.

Closed-Loop Test

Open-Loop Test

*

*

  • NOPaxos reimplemented with a software sequencer

because programmable switches are not available in the public cloud.

22 of 45

22

Yet, consensus is often still a bottleneck (Pulsar as an Example)

Brokers

Producer

Consumer

Dispatcher

Bookies

Bookies

Bookkeeper

BK Client

Local Zookeeper1

Global Replicator

Global Zookeeper

1. The latest version of Pulsar also supports other types of consensus modules including etcd (i.e., an Raft Implementation)

Persist the meta data (e.g., the topic partition info and the message offset on bookies)

Store the data (topic-based messages)

Bottleneck

Bottleneck

23 of 45

Fast path

23

DOM-S tags the deadline to the request

DOM-S multicasts the request to all replicas, and it is accepted by the Early Buffer

DOM-R releases the request according to its deadline and appends to log

Leader executes the request

Replicas compute a hash and reply to the client for quorum check

Proxy

Quorum Check

 

Follower

1

2

3

4

5

6

Log

Leader

1

2

3

4

5

6

Log

Follower

State Machine

DOM-S

DOM-R

1

2

3

4

5

6

Log

Requests from clients

Replies to clients

Late Buffer

Early Buffer

DOM-R

Late Buffer

Early Buffer

DOM-R

Early Buffer

Late Buffer

24 of 45

Slow path

24

DOM-S tags the deadline to the request

DOM-S multicasts the request to all replicas, and it is accepted by the Late Buffer

Leader modifies the request deadline to make it eligible to enter the Early Buffer

DOM-R releases the request from the leader’s Early Buffer

Leader executes the request

Leader replies to the client

Leader broadcasts its log indices to the followers

Followers modify their logs to keep consistent with the leader

Followers fetch missing logs from their Late Buffers

Followers reply to proxy for quorum check

Proxy

Quorum Check

 

Follower

1

2

3

4

5

6

Log

Leader

1

2

3

4

5

6

Log

Follower

State Machine

DOM-S

1

2

3

4

5

6

Log

Replies to clients

DOM-R

Early Buffer

Late Buffer

Early Buffer

Late Buffer

Early Buffer

Late Buffer

DOM-R

Requests from clients

25 of 45

Related Work

  • Spanner and CockroachDB: Rely on clocks for correctness by assuming worst-case error bounds; hard to guarantee such worst-case bounds with software clock synchronization

  • Granola and CLOCC: Timestamp based on transaction arrival, rather than a deadline.

25

26 of 45

Why consensus is still slow: the tax of coordination

  • Mis-ordering of events: Consensus aims to consistently order events across multiple servers

26

Client-1

Client-2

 

 

Replica-1

xxxxx

Replica-2

Replica-3

log list

 

 

 

 

 

 

Consensus: Replicate the same requests

across multiple servers

27 of 45

Why consensus is still slow: the tax of coordination

  • Mis-ordering of events: Consensus aims to consistently order events across multiple servers

27

Client-1

Client-2

 

 

Replica-1

xxxxx

Replica-2

Replica-3

log list

 

 

 

 

 

 

Consensus: Replicate the same requests

across multiple servers

28 of 45

Reducing the need for coordination

  • Mis-ordering of events: Consensus aims to consistently order events across servers

28

Message reordering in the network

Mis-ordering happens

because

Reduce message reordering

  • What about the public cloud?
    1. The tenant cannot engineer the network
    2. The reordering problem is quite bad

Prior work uses highly-engineered networks

Speculative Paxos,

NetPaxos, etc.

Equalize transmission path

In-network serialization

NOPaxos, NetChain,

Hydra, etc.

29 of 45

29

2 n1-standard-16 receivers + 2 n1-standard-4 senders

2 n1-standard-16 receivers + multiple n1-standard-4 senders (each multicasting at 10K requests/sec)

Serious message reordering in public cloud

🡪 Much room to improve protocol performance

Per-Sender Multicast Rate

20

40

60

80

100

Reordering Score

13.9

28.2

28.5

30.0

33.0

Number of Senders

2

4

6

8

10

Reordering Score

7.3

23.6

31.3

39.5

42.7

Quantifying Reordering in the Public Cloud (Google Cloud)

30 of 45

Nezha: High-performance and cloud native consensus

  • Reduce message reordering in the public cloud to accelerate protocols 

🡪 Use synchronized clocks through a new multicast primitive: deadline-ordered multicast

  • Why can synchronized clocks help?
    • To “equalize” the length between senders and receivers by adding endhost delays
    • Wait a little to avoid waiting even more to coordinate.

30

Performance

Network Support

NOPaxos

NetChain

SpecPaxos, NetPaxos

Paxos, Raft

Nezha

Hydra

31 of 45

  • Synchronize clocks on the sender and all receivers.

  • Measure sender-to-receiver one-way delays at all receivers.

  • Piggyback measurements back to sender

  • Sender computes the xth percentile one-way delay to reach receiver over a sliding window of measurement samples.

  • Then picks maximum of these one-way delays across all receivers as the deadline.

31

How are deadlines computed?

32 of 45

The Nezha consensus protocol

  • Nezha employs stateless proxies to do (1) DOM multicast and (2) quorum check
  • Proxies (1) remove heavy burden from leader/client (2) enjoy much better scalability (3) make Nezha a direct replacement of Raft/Multi-Paxos
  • Install DOM :
    • Proxy 🡪 DOM Sender
    • Replica 🡪 DOM Receiver
  • Only leader executes requests; followers only maintain a log list

32

Follower

Client

Follower

Leader

Client

Client

Proxy

Proxy

🡪 Easy to fix inconsistency between leader and followers

DOM sender

DOM receiver

DOM receiver

DOM receiver

33 of 45

Scalability with Replicas

33

Closed-loop Test

Open-loop Test

150 Clients

10 Clients

Bottleneck at clients

34 of 45

Proxy Evaluation

34

    • The proxies enables single client to maintain high throughput
    • The proxies reduce CPU costs for clients
    • The proxies also provide latency benefit under high throughput load

9 Replicas

9 Replicas

35 of 45

35

Comparison in WAN

Latency: 1.51x-3.37x; Throughput: 2.55x-10.13x

WAN setting further demonstrates Nezha’s advantage 🡪 1 WAN RTT (optimal) latency

36 of 45

Application (Redis)

36

  • 20 closed-loop clients submit request under 10ms SLO
  • YCSB-A Workload (HMSET/HGETALL on 1000 keys)
  • Nezha is only within 6% that of the unreplicated system

37 of 45

Application (CloudEx)

37

  • 48 open-loop traders + 16 gateways + 1 matching engine
  • Only Nezha saturates the matching engine processing capacity (~43K orders/sec) among the four protocols
  • Nezha prolongs end-to-end latency by 19.7% but achieves very close order processing latency

38 of 45

  • We kill the leader and measure
    • How long does it take for the remaining replicas to complete a view change with the new leader elected?

38

Failure Recovery

(150ms-300ms)

39 of 45

  • We kill the leader and measure
    • How long does it take for the remaining replicas to complete a view change with the new leader elected?
    • How long does it take to recover the throughput to the same level before crash?

39

Failure Recovery

20K reqs/sec

100K reqs/sec

200K reqs/sec

40 of 45

Prior Works: Raft/Multi-Paxos and NOPaxos

40

Follower

Client

Follower

Leader

Client

Client

heavy

Raft/Multi-Paxos

41 of 45

Prior Works

41

Follower

Client

Follower

Leader

Client

Client

Follower

Follower

Leader

Client

Quorum Check

Client

libnopaxos

Quorum Check

Client

Quorum Check

Sequencer

Highly Engineered Network

NOPaxos

libnopaxos

libnopaxos

client-intrusive

& heavy

Raft/Multi-Paxos

42 of 45

Drawbacks of Prior Works

  • Our goal:
    • Both high performance and easy deployment 🡪 Use DOM achieve this goal

42

Performance

Network Support

NOPaxos

NetChain

SpecPaxos, NetPaxos

Paxos, Raft

Nezha

Hydra

43 of 45

DOM: Reduce Reordering

43

Receiver-1

Receiver-2

Sender-1

Sender -2

Set a deadline ddl=10

Set a deadline

ddl=12

Hold the early-coming message until ddl

t=10

t=12

44 of 45

DOM: Reduce Reordering in Best Effort

44

Set a deadline ddl=10

Set a deadline

ddl=12

DOM cannot rectify such reordering cases

t=10

t=12

Receiver-1

Receiver-2

Sender-1

Sender -2

Too late!

45 of 45

  • Use probe packets to estimate the one-way delay (OWD) between the sender and receivers.
  • The sender collects many OWD samples, and use a sliding-window approach to calculate the estimated OWD

45

S

R1

R2

Rn

. . .

DOM-deadline estimation

 

 

 

 

 

When S multicasts to R1-Rn:

Larger percentile 🡪 Larger deadline