1 of 56

Multi Paxos

CSE 452 Spring 2026

2 of 56

Announcements

  • Pset 4, Lab 3 Design Doc due Monday (5/4)
  • Design doc workshop in section tomorrow
    • At a minimum, make sure to have already read the spec
  • (before writing lab 3 code): look through Debugging Guide
    • Go through debugging steps for failing tests before asking TAs for assistance!
  • (for W credit) Lab 2 Design Doc revisions due Wednesday (5/6)
    • Review all comments on design doc => page through even if full credit
    • Please highlight the areas/bullets that you’ve changed

3 of 56

Part-Time Parliament

My attempt at inserting some humor into the subject was a dismal failure. People who attended my lecture remembered Indiana Jones, but not the algorithm. People reading the paper apparently got so distracted by the Greek parable that they didn’t understand the algorithm. Among the people I sent the paper to, and who claimed to have read it, were Nancy Lynch, Vassos Hadzilacos, and Phil Bernstein. A couple of months later I emailed them the following question:

Can you implement a distributed database that can tolerate the failure of any number of its processes (possibly all of them) without losing consistency, and that will resume normal behavior when more than half the processes are again working properly?

None of them noticed any connection between this question and the Paxos algorithm.” –Leslie Lamport

4 of 56

Basic Paxos -> Multi-Paxos

Leader Election

Leader Election

Leader Election

Slot 1

Slot 2

Slot 3

Simple Paxos

Slot 1

Slot 2

Slot 3

Multi-Paxos

Leader Election

Why? Leader election can cause contention!

5 of 56

Possible order for implementation of Paxos

  1. Ballots
  2. Log(s)
  3. Accept messages/timers (P2As, P2Bs, Decisions)
    1. You can start by hard-coding one of the servers as the leader
    2. Make sure executions are happening in order (starting from slot_out and don’t skip slots) and progress can be made for clients
  4. Leader election/Prepares messages/timers (P1As, P1Bs)
    • Start leader election at the very start
    • Check if leader is alive
  5. Heartbeats
    • Catch-up mechanism
  6. Garbage collection

6 of 56

Executing RPCs: Stable Leader (P2)

Multi-Paxos Edition

7 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

Acceptor Ballot

Leader Ballot

8 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

P2A

ballot: 1.3

slot number: 1

command: A

9 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

P2A

ballot: 1.3

slot number: 2

command: B

B

S3

10 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

P2B

ballot: 1.3

slot number: 2

B

S3

B

11 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

S2, S3

B

Can server 3 execute B?

12 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

S2, S3

B

P2A

ballot: 1.3

slot number: 1

command: A

13 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

S2, S3

B

P2B

ballot: 1.3

slot number: 1

A

14 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S2, S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

S2, S3

B

A

15 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S2, S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

S2, S3

B

A

Heartbeat

ballot: 1.3

chosen: {1 → A, 2 → B}

16 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S2, S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

S2, S3

B

A

17 of 56

Propose phase implementation notes

  • Hardcoding a server as the leader can pass 20/26 tests
  • You should not send messages to yourself over the network
    • Ex. Instead of sending P1a, call the handleP1b method handler
  • Think about how you’ll retry P2a messages: timer per slot or for the entire server
    • Pros and cons of both

18 of 56

Leader Election

Multi-Paxos Edition

19 of 56

Overview

Leader election will begin upon:

  • System initialization
  • Current leader timeout due to lack of heartbeat

P1a

  • The candidate will send out P1a to all other nodes
    • Other nodes will promise not to accept anything with a lower ballot

P1b

  • The followers will reply with P1b to candidate
    • The candidate will learn the previous state of the system

P2a

  • The newly elected leader will send out P2a for all of its accepted slots

20 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 1.3

Active: true

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 0.0

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

A

B

C

A

C

Heartbeat()

Heartbeat()

21 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 1.3

Active: true

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 0.0

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

Server 1 detects server 3 timeout!

A

B

C

A

C

Heartbeat()

Heartbeat()

22 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 2.1

Active: false

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 2.1

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 2.1

Active: false

A

B

C

A

C

P1a(2.1)

P1a(2.1)

P1a(2.1)

23 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 2.1

Active: false

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 2.1

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 2.1

Active: false

A

B

C

A

C

P1b([<1.3, 1, A>, <1.3, 3, C>])

P1b([])

P1b([<1.3, 1, A>, <1.3, 2, B>, <1.3, 3, C>])

24 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 2.1

Active: false

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 2.1

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 2.1

Active: false

A

B

C

A

C

P1b([<1.3, 1, A>, <1.3, 3, C>])

P1b([])

P1b([<1.3, 1, A>, <1.3, 2, B>, <1.3, 3, C>])

What should the status of slot 1 be?

25 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 2.1

Active: false

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 2.1

Active: true

1

2

3

4

...

Server 2

Command:

Ballot: 2.1

Active: false

A

B

C

A

C

P1b([<1.3, 1, A>, <1.3, 3, C>])

P1b([])

P1b([<1.3, 1, A>, <1.3, 2, B>, <1.3, 3, C>])

A

B

C

26 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 2.1

Active: false

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 2.1

Active: true

1

2

3

4

...

Server 2

Command:

Ballot: 2.1

Active: false

A

B

C

A

C

p2a(2.1, 2, B)

A

B

C

p2a(2.1, 2, B)

p2a(2.1, 2, B)

27 of 56

Leader Notes

  • Timer values will matter. We need to give enough time for a server to have a chance to win an election.
  • As soon as you get a p1a with a higher ballot, server should back off before trying to become leader.
  • Heartbeat messages are how the leader tells followers that its alive

28 of 56

Let’s take a deeper look

29 of 56

“Log Merging” rules

For each slot in an incoming P1b message:

  • Keep the command with the highest ballot number
  • If we’ve received the same command with the same ballot number from a majority of the nodes, mark the command as chosen
  • If no command is found for a slot, put a No-Op command there instead

Warning: Do not try to merge directly into your log!

30 of 56

1

2

3

4

...

P1b

(1.3, 1, A)

(2.1, 2, C)

(2.1, 4, D)

Not started

Accepted

Chosen

Incoming Messages To Merge

1

2

3

4

...

P1b

(1.3, 2, B)

1

2

3

4

...

Temp Log

Ballot: 3.2

(1.3, 1, A)

(1.3, 3, E)

Candidate’s temporary log state

31 of 56

1

2

3

4

...

P1b

(1.3, 1, A)

(2.1, 2, C)

(2.1, 4, D)

Not started

Accepted

Chosen

Incoming Messages To Merge

1

2

3

4

...

Temp Log

Ballot: 3.2

(1.3, 1, A)

(1.3, 3, E)

Candidate’s temporary log state

(1.3, 2, B)

32 of 56

1

2

3

4

...

P1b

(1.3, 1, A)

(2.1, 2, C)

(2.1, 4, D)

Not started

Accepted

Chosen

Incoming Messages To Merge

1

2

3

4

...

Temp Log

Ballot: 3.2

(1.3, 1, A)

(1.3, 3, E)

Candidate’s temporary log state

(1.3, 2, B)

33 of 56

1

2

3

4

...

P1b

(1.3, 1, A)

(2.1, 2, C)

(2.1, 4, D)

Not started

Accepted

Chosen

Incoming Messages To Merge

1

2

3

4

...

Temp Log

Ballot: 3.2

(1.3, 1, A)

(1.3, 3, E)

Candidate’s temporary log state

(2.1, 2, C)

34 of 56

1

2

3

4

...

P1b

(1.3, 1, A)

(2.1, 2, C)

(2.1, 4, D)

Not started

Accepted

Chosen

Incoming Messages To Merge

1

2

3

4

...

Temp Log

Ballot: 3.2

(1.3, 1, A)

(1.3, 3, E)

Candidate’s temporary log state

(2.1, 2, C)

(2.1, 4, D)

35 of 56

Not started

Accepted

Chosen

Incoming Messages To Merge

1

2

3

4

...

Temp Log

Ballot: 3.2

Slot in:

Slot out:

(1.3, 1, A)

(3.2, 3, E)

Merge Temporary log into actual log!

(3.2, 2, C)

(3.2, 4, D)

36 of 56

Not started

Accepted

Chosen

Incoming Messages To Merge

1

2

3

4

...

Temp Log

Ballot: 3.2

Slot in:

Slot out:

(1.3, 1, A)

(3.2, 3, E)

Should we execute slot 1?

(3.2, 2, C)

(3.2, 4, D)

37 of 56

“Log Merging” rules w/ extra note

For each slot in an incoming P1b message:

  • Keep the command with the highest ballot number
  • If we’ve received the same command with the same ballot number from a majority of the nodes, mark the command as chosen
  • If no command is found for a slot, put a No-Op command there instead

38 of 56

Additional Notes on “Log Merging”

Several ways to implement:

  1. Merge P1bs into temp log as they’re received (keep track command with largest ballot num as P1bs are received)
    • Make sure to only take P1bs that correspond to the current election
  2. Store P1bs. Only merge them together after receiving a majority
    • Temp log not necessary in this case
  3. Other ways too!

39 of 56

Log Merging:

Dealing with Log “Holes”/Proposing no-ops

If we have in our log:

put(a, 1), ______, ______, append(a, 2)

Do we know those empty log slots are decided? How can we push these log slots to completion and execute? Why do we need no-ops?

Your implementation needs to be able to handle "holes" in the Paxos log. That is, at certain points, a server might see agreement being reached on a slot but not previous slots. Your implementation must still get to where it can execute the append, even if no more client requests arrive.

40 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

41 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

P2A

ballot: 1.3

slot number: 1

command: A

42 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

P2A

ballot: 1.3

slot number: 2

command: B

B

S3

43 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

P2B

ballot: 1.3

slot number: 2

B

S3

B

44 of 56

1

2

3

4

...

Server 3

Command:

Recv’d P2Bs:

Ballot: 1.3

Active: true

A

S3

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.2

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

S3

B

X

Log hole at slot 1

45 of 56

Garbage Collection

  • Problem: log (and thus memory usage) can grow indefinitely
  • Solution: garbage collect (delete) log entries that are no longer needed
  • Can discard log slots that everyone has already executed
  • Need a way to determine what slots other nodes are done with
    • One solution: piggyback this information on heartbeat replies
      • Needs some state about what everyone knows

46 of 56

Garbage Collection: 3 servers in the system

Server1 (Leader):

Latest executed slot: 4

Server2:

Latest executed slot: 4

Server3:

Latest executed slot: 3

Hasn’t heard about slot 4

Can garbage collect slots 1-3

47 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 1.3

Active: true

A

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.3

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

B

A

A

48 of 56

1

2

3

4

...

Server 3

Command:

Ballot: 1.3

Active: true

A

Not started

Accepted

Chosen

1

2

3

4

...

Server 1

Command:

Ballot: 1.3

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

B

A

A

firstNotExecuted: 3

firstNotExecuted: 2

firstNotExecuted == slot_out

49 of 56

2

3

4

...

Server 3

Command:

Ballot: 1.3

Active: true

Not started

Accepted

Chosen

Server 1

Command:

Ballot: 1.3

Active: false

1

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

B

A

A

firstNonCleared: 2

...

2

3

4

1

50 of 56

2

3

4

...

Server 3

Command:

Ballot: 1.3

Active: true

Not started

Accepted

Chosen

Server 1

Command:

Ballot: 1.3

Active: false

2

3

4

...

Server 2

Command:

Ballot: 1.3

Active: false

B

B

...

2

3

4

51 of 56

In case we have time

52 of 56

Possible order for implementation of Paxos

  1. Ballots
  2. Log(s)
  3. Accept messages/timers (P2As, P2Bs, Decisions)
    1. You can start by hard-coding one of the servers as the leader
    2. Make sure executions are happening in order (starting from slot_out and don’t skip slots) and progress can be made for clients
  4. Leader election/Prepares messages/timers (P1As, P1Bs)
    • Start leader election at the very start
    • Check if leader is alive
  5. Heartbeats
    • Catch-up mechanism
  6. Garbage collection

53 of 56

Recommendation: Ballots

Even though doing something like having ballots being 1.1, 1.2, 1.3, 2.1, ... would work, we recommend making ballot a class like:

  • sequence/round number
  • Address

E.g. (3, server1)

Then making that comparable and defining a compareTo.

We recommend using a single ballot on each server (not for each role).

If you define the Ballot as an inner class, remember to make it static! If not, it might lead to failed garbage collection or lead to inefficiencies in testing and cause you to fail tests you would otherwise pass. Also make sure it’s serializable and has equals and hashcode!

54 of 56

Recommendation: Roles

  • As the leader:
    • Act as a proposer and acceptor from Basic Paxos
    • Propose requests from clients
      • First, check if the command has already been proposed, decided, or executed
    • Keeps replicas up to date
    • Send heartbeats to servers so they know the leader is alive
      • Can include garbage collection information in these messages
  • As anyone else (not the leader):
    • Drop client requests
    • Act only as an acceptor, not a proposer
      • That is, until the server believes the leader died. Then it should start phase 1 �(scout phase)

55 of 56

Recommendation: Logs

For your logs, store it as

Map<Integer, LogEntry> log; // Where slot number -> log entry

Where LogEntry contains:

  • Ballot (so round number/sequence number and address)
  • Paxos Log Status
  • Command

If you define the LogEntry as an inner class, remember to make it static! If not, it might lead to failed garbage collection or lead to inefficiencies in testing and cause you to fail tests you would otherwise pass. Also make sure it’s serializable and has equals and hashcode!

56 of 56

Some cases/questions to consider when writing your code

  • Receiving a request that’s already in your log as a proposer
  • When do holes come up in your log? When should you propose no-ops?
  • What happens when you receive a larger ballot in a message?
  • How does a server know who the leader is?
  • If you receive an accept message for a slot that you already know is chosen, what should you do?
  • Why do we execute the log in order?
  • What happens when a leader dies?