1 of 28

ReCraft: Self-Contained Split, Merge, and Membership Change of Raft Protocol

Kezhi Xiong1 Soonwon Moon2 Joshua H. Kang1 Bryant Curto1

Jieung Kim3 Ji-Yong Shin1

1Northeastern University 2Seoul National University 3Yonsei University

The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, June 26, 2025

1

2 of 28

Reconfiguration

  1. Changing member nodes of a distributed system
  2. Changing the organization (e.g., for sharding) of a distributed system

2

Crucial for liveness and performance of a distributed system

3 of 28

Reconfiguration for distributed systems

  • Configuration change needs to be done consistently

3

Consensus-based services�(e.g., Zookeeper, etcd)

  • Highly available
  • Strongly consistent

KV Store

AI/ML

Data Analytics

Config

Config

Config

4 of 28

Reconfiguration of consensus-based systems?

  • Reliance on external services �(consensus-based system relying on consensus-based system)

4

Consensus-based services

Configuration manager

Consensus-based system

Configuration manager

Consensus-based system

5 of 28

Reconfiguration of consensus-based systems?

  • Self-contained approach

5

Consensus-based system

Consensus

New configuration

Consensus decides new configuration

New configuration decides how consensus works

State machine replication log

Update

Update

New�Config

Update

Initial�Config

    • Non-stopping approach: Reconfigure while system is running� (Raft and ReCraft approach)
    • Stopping approach: Stop regular ops -> Reconfigure -> Resume� (Bad for performance)

6 of 28

ReCraft: Self-Contained Split, Merge, and � Membership Change of Raft Protocol

❷ Why new membership change?

    • Design from new insight �[Honoré et al., PLDI 2022]

❸ Why new split and merge protocols?

    • No self-contained, non-stopping scheme
    • For better fault-tolerance, performance, and operability

6

Membership Change

Splitting/Merging

Raft

Non-stopping

N/A

Multi-Raft�(TiKV, CockroachDB)

N/A

External cluster manager

Stopping

ReCraft

Non-stopping� Fault-tolerant & performant

Easier-to-implement

Self-contained� Non/Minimal-stopping

❶ Why Raft? Most popular consensus protocol

7 of 28

ReCraft

  • Introduction

  • Raft/ReCraft Assumptions and a Brief Intro to Membership Change(Please refer to the paper for more details)

  • ReCraft Split and Merge Protocols

  • Evaluation

  • Conclusion

7

8 of 28

ReCraft and Raft assumptions and conditions

  • Inherits Raft’s assumptions
    • Asynchronous non-Byzantine network failures
    • Each cluster of N nodes with a quorum size Q can tolerate up to f = N – Q node failures

  • Inherits Raft’s preconditions for reconfiguration
    1. All prior reconfiguration in the leader’s log must be committed� [Originally proposed in Ongaro’s PhD Thesis, 2014]
    2. Consecutive configurations should always maintain a quorum overlap� [Relaxed by Honoré et al., PLDI 2022]
    3. Leader must commit a log entry in its term before starting a reconfiguration� [Added by Ongaro as bug fix, 2015]

8

Q is usually majority, but can be bigger during reconfiguration

9 of 28

Raft Revisited

9

Total # of nodes = 2f+1 = 3

Quorum size = f+1 = 2

Max # of failures = f = 1

A

C

B

0

α

0

α

0

α

Req Vote Term #1

Leader Election

(Leader must have an up-to-date log)

Req Vote Term #2

Committed due to Quorum Overlap:

all possible majority sets contain “β”

{A, B} {A, C} {B, C} {A, B, C}

New leader carries β

1

Replication

A

C

B

1

0

α

0

α

0

α

A

C

B

Leader Election

0

α

0

α

0

α

1

β

1

β

2

γ

2

γ

2

γ

2

γ

1

β

Replication

1

β

A

C

B

0

α

0

α

0

α

1

β

2

1

β

1

β

β

1

!

All decisions require an agreement of a quorum

!

10 of 28

Raft Membership Change

10

Old Config {A, B, C}

New Config (NC) {A,B,C,D,E}

Raft

B

C

D

A

E

NC

JC

JC

JC

JC

NC

NC

NC

NC

JC

NC

Joint Config (JC) {A,B,C} {A,B,C,D,E}

B

C

D

A

E

JC

JC

JC

JC

JC

JC

B

C

A

Checks for quorum Q-new:

3 of {A,B,C,D,E}

Checks for quorum Q-old:

2 of {A,B,C}

Checks for Q-old

Checks for Q-new

1. activates TWO quorums

JC

2. adjusts to Q-new only

NC

11 of 28

Raft vs ReCraft Membership Change (see Paper for details)

11

Old Config {A, B, C}

New Config (NC) {A,B,C,D,E}

2. adjusts to Q-new (conditional)

ReCraft

Raft

B

C

D

A

E

NC

N+

N+

N+

N+

NC

NC

NC

NC

N+

NC

Interim Config (N+) {A,B,C,D,E}

B

C

D

A

E

N+

N+

N+

N+

N+

N+

B

C

A

Checks for Q-new

Checks for Q-old

Checks for quorum

Q-new+: 4 of {A,B,C,D,E}

Minimum overlapping

quorum size with Q-old�Naturally,

Q-new+ Q-new

1. activates TWO quorums

JC

2. adjusts to Q-new only

NC

1. activates ONE Q-new+ quorum

N+

When Q-new+ = Q-new,

ReCraft can omit one consensus step

(e.g., adding 2 nodes to 2-node cluster)

NC

12 of 28

ReCraft

  • Introduction

  • Raft and ReCraft Assumptions and Membership change� (Please refer to the paper for more details)

  • ReCraft Split and Merge Protocols

  • Evaluation

  • Conclusion

12

13 of 28

ReCraft Split Based on Quorum Overlap

13

B

C

A

E

D

  • Conf-old
    • Range: α – ω
    • Q-old: 3 out of A-E

EnterJoint (EJ)

B

C

A

E

D

  • Conf-new-1
    • Range: α – μ
    • Q-new-1: 2 out of A-C

  • Conf-new-2
    • Range: ν - ω
    • Q-new-2: 2 out of D-E

B

C

A

E

D

  • Conf-joint
    • Range: α – μ
    • Q-joint: Q-new-1 + Q-new-2

LeaveJoint (LJ)

 

14 of 28

ReCraft Split

14

B

τ

C

τ

A

τ

E

τ

D

τ

EJ

EJ

EJ

EJ

EJ

α

α

α

α

α

EnterJoint

When received:

Election: Q-joint

Commit: Q-old

B

τ

C

τ

A

τ

E

τ

D

τ

EJ

EJ

EJ

EJ

EJ

LJ

LJ

LJ

LJ

LJ

α

α

α

α

α

LeaveJoint

LeaveJoint

When received:

Election: Q-joint

Commit: Q-new-1

When received:

Election: Q-joint

Commit: Q-new-2

LeaveJoint

LeaveJoint

When committed:

Election: Q-new-1

Commit: Q-new-1

When committed:

Election: Q-new-2

Commit: Q-new-2

EnterJoint (EJ)

LeaveJoint (LJ)

Split

Done

B

τ

C

τ

A

τ

E

τ

D

τ

α

α

α

α

α

Epoch: 1

Conf-old

Election: Q-old

Commit: Q-old

Allows non-stopping

updates to new config

Commit with

smaller # of messages

than Q-joint > Q-old

Prefix of term #;

incremented when Split/Merge succeeds

B

C

A

E

D

τ

τ

τ

τ

τ

EJ

EJ

EJ

EJ

EJ

LJ

LJ

LJ

LJ

LJ

β

β

β

π

π

α

α

α

α

α

Epoch: 2

Epoch: 2

α – μ

ν - ω

α – ω

15 of 28

What can go wrong?

15

B

τ

C

τ

A

τ

E

τ

D

τ

α

α

α

α

α

Epoch: 1

Conf-old

Election: Q-old

Commit: Q-old

E

τ

D

τ

EJ

EJ

LJ

LJ

α

α

Epoch: 2

B

τ

C

τ

A

τ

EJ

EJ

EJ

LJ

LJ

LJ

α

α

α

Epoch: 2

E

τ

D

τ

α

α

Epoch: 1

EJ

B

τ

C

τ

A

τ

EJ

EJ

EJ

α

α

α

EnterJoint

When received:

Election: Q-joint

Commit: Q-old

E

τ

D

τ

α

α

!

!

LJ

B

τ

C

τ

A

τ

EJ

EJ

EJ

LJ

LJ

LJ

α

α

α

E

τ

D

τ

α

α

!

!

LeaveJoint

When received:

Election: Q-joint

Commit: Q-new-1

LeaveJoint

When committed:

Election: Q-new-1

Commit: Q-new-1

D and E can get stuck

EJ

EJ

LJ

LJ

If epoch is larger,

pull data up to LJ

Updates only when Split or Merge is done.

Clearly marks configuration change

Epoch: 1

α – ω

α – ω

α – μ

ν - ω

16 of 28

ReCraft Merge: Lock-based-2PC over Consensus

16

Merge�Prep (TX)

B

C

A

α

α

α

TX

TX

TX

B

C

A

α

α

α

TX

TX

TX

E

F

D

π

π

π

TX

TX

TX

Merge

Commit�(C)

B

C

A

α

α

α

TX

TX

TX

C

C

C

E

F

D

π

π

π

TX

TX

TX

C

C

C

C

OK

Snapshot

Exchange & Merge

B

C

A

α π

α π

α π

E

F

D

α π

α π

α π

α

π

B

C

A

α

α

α

Epoch: 3

E

F

D

π

π

π

Epoch: 5

α – μ

ν - ω

B

C

A

α π

α π

α π

E

F

D

α π

α π

α π

Epoch: 6

α – ω

TX

2PC prepare

2PC commit/abort

Data exchange

Only blocking operation in ReCraft

17 of 28

What can go wrong?

17

B

C

A

α

α

α

Epoch: 3

E

F

D

π

π

π

Epoch: 5

α – μ

ν - ω

2PC prepare

Merge�Prep (TX)

B

C

A

α

α

α

TX

TX

TX

Merge�Prep (TX)

E

F

D

π

π

π

TX

TX

TX

Merge

Abort�(A)

B

C

A

α

α

α

TX

TX

TX

A

A

A

E

F

D

π

π

π

TX

TX

TX

A

A

A

2PC commit/abort

Merge

Abort�(A)

B

C

A

α

α

α

TX

TX

TX

E

F

D

π

π

π

TX

TX

TX

E

F

D

π

π

π

TX

TX

TX

A

A

A

B

C

A

α

α

α

TX

TX

TX

A

A

A

The main cause of a “NO” answer is

ongoing transaction, our precondition #1

TX

TX

NO

NO

Epoch: 3

Epoch: 5

α – μ

ν - ω

18 of 28

Evaluation

  • Implementation
    • Modified Raft library and relevant code in etcd ver 3.5
    • 4K+ lines of Go code modification

  • Experimental setup
    • Public research cloud
    • 16 VMs each with 2 vCPUs and 8GB DRAM
    • Uniform random put workload

  • Baseline
    • Emulation of Multi-Raft design (TiKV and CockroachDB) on etcd

18

19 of 28

Adopted Multi-Raft baseline

19

α – ω

α – ω

N/A

REMOVE

& RESET

STOP

ν - ω

α – μ

RUN

ν - ω

α – μ

STOP

α - ω

α – μ

COPY

DATA

α - ω

RESET

N/A

ADD MEMBERS

α – ω

RUN

Split Procedures

Cluster Manager

ν - ω

α – μ

Merge Procedures

Cluster Manager

  • Relies on external configuration manager
  • Stops extensively

ν - ω

α – μ

COPY

DATA

20 of 28

Split

20

ReCraft

Multi-Raft

Duration of Performance Dip for Split

ReCraft Does not need data transfer: always constant time

1 x 6 node cluster to 2 x 3-node clusters

Lower is better

21 of 28

Merge

21

ReCraft

Multi-Raft

Duration of Performance Dip for Merge

ReCraft blocks minimally and transfers data in parallel

MultiRaft serially sends data through the cluster manager

2 x 3-node clusters to 1 x 6 node cluster

Lower is better

22 of 28

Fault tolerance

22

operation

ReCraft

Multi-Raft

Phase 1

Phase 2

Phase 3

Standalone Cluster Manager

Replicated Cluster Manager

Split

fold + 1

N (fsub + 1)

-

1

fcm + 1

Merge

fsub + 1

fsub + 1

fsub + 1

1

fcm + 1

Minimum # of node failures to completely stop the split/merge

in previous experiments

Fail 6-node cluster = 3

Fail two 3-node clusters = 4

Fail one 3-node cluster = 2

Fail one 3-node cluster = 2

Fail one 3-node cluster = 2

Fail standalone CM = 1

Fail standalone CM = 1

Fail one 3-node cluster = 2

Fail one 3-node cluster = 2

Triple replication with Raft

  • Multi-Raft requires running an extra CM module
  • ReCraft can serve clients under failures due to non-stopping design

23 of 28

Conclusion

  • ReCraft: reconfiguration protocol for Raft

  • Improved membership change protocol to Raft�(details in the paper)

  • First self-contained split and merge protocol for Raft �(can be ported to multi-Paxos as well)

  • Etcd implementation shows effectiveness against Multi-Raft

23

24 of 28

Thank you!�Q & A�

More in the paper

  • Detailed ReCraft membership change
  • Handling other corner cases for split and merge
  • Formal proof sketch of correctness
  • Links to
    1. Artifact (Go code & Rocq proofs)
    2. Extended version of the paper

Ji-Yong Shin �(j.shin@northeastern.edu; https://www.jiyongshin.info)

25 of 28

Extra: ReCraft Membership change vs Raft

  • # of required messages for reconfiguration

25

26 of 28

Extra: ReCraft Membership change vs Raft

  • # of consensus for reconfiguration

26

27 of 28

Fault tolerance

27

operation

ReCraft

MultiRaft

Phase 1

Phase 2

Phase 3

Standalone CM

Replicated CM

Split

fold + 1

N (fsub + 1)

-

1

fcm + 1

Merge

fsub + 1

fsub + 1

fsub + 1

1

fcm + 1

Minimum # of node failures to completely stop the split/merge

  • fold = f of 6 node cluster
  • fsub = f of 3 node cluster
  • fcm = f of 3 node cluster

Fail 6-node cluster = 3

Fail two 3-node clusters = 4

Fail one 3-node cluster = 2

Fail one 3-node cluster = 2

Fail one 3-node cluster = 2

Fail standalone CM = 1

Fail standalone CM = 1

Fail one 3-node cluster = 2

Fail one 3-node cluster = 2

Triple replication with Raft

  • MultiRaft requires maintaining extra CM module
  • ReCraft can servicing client under partial failures �due to non-stopping design

28 of 28

Raft vs ReCraft Membership Change (see Paper for details)

28

Old Config {A, B}

New Config (NC) {A, B, C, D}

B

A

α

α

Checks for quorum Q-old:

2 of {A,B}

C

D

Joint Config (JC) {A, B} {A, B, C, D}

B

A

Checks for Q-old

Checks for Q-new

α

JC

α

JC

α

JC

α

JC

JC

1. Activates Joint

B

C

D

A

Checks for quorum Q-new:

3 of {A,B,C,D}

α

JC

α

JC

α

JC

α

JC

NC

NC

NC

NC

2. Deactivates Joint Mode

NC

2. Adjusts quorum to Q-new, � if necessary

1. Activates Q-new+

Checks for �Q-new+

Min quorum size that

overlaps with Q-old

B

α

A

α

B

α

N+

C

α

N+

D

α

N+

A

α

N+

N+

In general,

Q-new+ Q-new

ReCraft

Raft

Q-new+ = 3 out of {A, B, C, D}

Because Q-new+ = Q-new, second step is omitted

Checks for Q-old

Checks for Q-new