1 of 78

Section 6: Lab 3 (contd.)

CSE 452 Spring 2021

2 of 78

Basic Paxos -> Multi-Paxos

Last week:

  • We know how to reach consensus on a single value (basic Paxos!)

This week (and what we’re doing in lab 3):

  • Do Paxos, but instead of doing it once, do it multiple times; �once for each index (or “slot”) in a log
  • Once a value (command) has become chosen we can try to execute it.
    • Important: Need to execute log in order, i.e. moving from slot to slot in order,
      • E.g. execute slot 1 first, then 2, then 3, …
      • You can start executing from where you left off last time

3 of 78

Slot 1

Slot 2

Slot 3

Log

Paxos!

Paxos!

Paxos!

...

4 of 78

Problem: Contention

If many proposers are proposing values for the same slot at the same time, problems can occur

  • Acceptors may reject accept requests if proposal number isn’t high enough �→ Proposers may retry with a higher proposal number �→ Proposers keep proposing with higher and higher proposal numbers �→ No progress is made in the system

5 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

-1, -1, null

-1, -1, null

-1, -1, null

6 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

-1, -1, null

-1, -1, null

-1, -1, null

P(1)

P(1)

7 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, -1, null

-1, -1, null

1, -1, null

PR(-1, null)

PR(-1, null)

8 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, -1, null

-1, -1, null

1, -1, null

A(1, v1)

A(1, v1)

9 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, 1, v1

-1, -1, null

1, -1, null

AR(accept)

A(1, v1)

10 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, 1, v1

-1, -1, null

1, -1, null

A(1, v1)

P(2)

P(2)

11 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, 1, v1

2, -1, null

2, -1, null

A(1, v1)

PR(-1, null)

PR(-1, null)

12 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, 1, v1

2, -1, null

2, -1, null

A(1, v1)

A(2, v2)

A(2, v2)

13 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, 1, v1

2, 2, v2

2, -1, null

A(1, v1)

A(2, v2)

AR(accept)

14 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, 1, v1

2, 2, v2

2, -1, null

A(1, v1)

A(2, v2)

15 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, 1, v1

2, 2, v2

2, -1, null

AR(reject)

A(2, v2)

16 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

1, 1, v1

2, 2, v2

2, -1, null

A(2, v2)

P(3)

P(3)

17 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

3, 1, v1

2, 2, v2

3, -1, null

A(2, v2)

PR(1, v1)

PR(-1, null)

18 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

3, 1, v1

2, 2, v2

3, -1, null

A(2, v2)

A(3, v1)

A(3, v1)

19 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

3, 3, v1

2, 2, v2

3, -1, null

A(2, v2)

AR(accept)

A(3, v1)

20 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

3, 3, v1

2, 2, v2

3, -1, null

A(2, v2)

A(3, v1)

21 of 78

p1

p2

A1

A2

A3

HP#, HA#, HAV

3, 3, v1

2, 2, v2

3, -1, null

AR(reject)

A(3, v1)

22 of 78

Problem: Contention

  • If many proposers are sending values for the same slot at the same time, problems can occur
    • Acceptors may reject accept requests if proposal number isn’t high enough
  • Don’t want to deal with contention for every log slot!
  • Solutions:
    • Random timeouts
    • Exponential backoff
    • Stable leader (Recommended) aka distinguished proposer

23 of 78

Stable Leader - Multi-Paxos (PMMC)

  • Why?
    • Prevents contention (generally)
    • Enable 1 round trip instead of 2 round trips for most requests
  • How?
    • Elect a single server as leader
    • Use prepare phase to become leader!
    • Once leader, skip prepare phase, only run accept phase
  • Issues:
    • Multiple servers may think they are leader
    • Can still have contention among servers trying to become leader
  • PMMC describes an implementation of Multi-Paxos

24 of 78

Terminology (Basic Paxos -> PMMC)

  • Proposal number -> ballot/ballot number
    • Ballot: (round number, address)
  • Prepare request/response -> phase 1 (P1A, P1B, scouts)
  • Accept request/response -> phase 2 (P2A, P2B, commanders)

25 of 78

Recommendation: Ballots

Even though doing something like having ballots being 1.1, 1.2, 1.3, 2.1, ... would work, we recommend making ballot a class like:

  • sequence/round number
  • Address

E.g. (3, server1)

Then making that comparable and defining a compareTo.

We recommend using a single ballot on each server (not for each role).

If you define the Ballot as an inner class, remember to make it static! If not, it might lead to failed garbage collection or lead to inefficiencies in testing and cause you to fail tests you would otherwise pass. Also make sure it’s serializable and has equals and hashcode!

26 of 78

PMMC vs. Lab 3

  • Split into roles (replicas, acceptors, leaders, etc…)
  • Multi-threaded
  • Handles reconfiguration
  • Retries leader election when preempted
  • One class (PaxosServer)
    • Handles all the roles
    • Recommendation: Don’t use a separate object for each role*
  • Single-threaded on each Node
  • Ignores reconfiguration
  • Uses heartbeat messages to determine when to begin leader election

27 of 78

28 of 78

PMMC (Multi-Paxos) Properties

  • Active leader election (Phase 1) using a Scout (can be a separate function)
  • Follower nodes can drop client requests
    • Check if a request has already been executed, if so, return result
  • Active leader will propose values (Phase 2)
    • Check if a request has already been proposed/decided/executed
  • Follower will start phase 1 if it suspects the active leader is dead
  • Rinse and repeat

29 of 78

Recommendation: Logs

For your logs, store it as

Map<Integer, LogEntry> log; // Where slot number -> log entry

Where LogEntry contains:

  • Ballot (so round number/sequence number and address)
  • Paxos Log Status
  • Command

If you define the LogEntry as an inner class, remember to make it static! If not, it might lead to failed garbage collection or lead to inefficiencies in testing and cause you to fail tests you would otherwise pass. Also make sure it’s serializable and has equals and hashcode!

30 of 78

Phase 1: Leader election

  • Goal: We want to only have one proposer at a time, will call this a distinguished proposer or the leader (prevents contention, 1 RTT instead of 2)
  • Need way to pick a leader (many different solutions).
    • E.g. largest ballot number (using compareTo) out of those pinged tries to become leader (starts phase 1)
    • Once a leader is elected, it can send a heartbeat message to every other server every T ms
    • If a server hasn’t received heartbeat from the leader for an entire HeartbeatCheckTimer interval (similar to PingCheckTimer), it decides to try to become the new leader

31 of 78

Phase 1: How to get elected

  • Multiple nodes may attempt to become leader at once, can cause contention for the position
  • To become leader, send phase 1 (P1A) messages to all acceptors with your ballot number
  • If the ballot is higher than any ballot the acceptor has seen before, the acceptor should update its ballot number, then respond with a message containing accepted values for all slots
    • Don’t try to become the leader again right away
    • Can include chosen values to avoid needing to go through consensus again
  • If the sender receives a majority of responses from acceptors with its ballot, it is now the leader
    • The sender can merge each log as it receives them to avoid keeping them around (Why?)
  • What happens if the sender doesn’t receive a majority of responses?
    • Keep trying until you get a message with a higher ballot number (or you get elected)

32 of 78

Phase 1: I’m elected, now what?

  • To preserve linearizability, newly elected leader must update its state with accepted values from acceptors
    • For each slot, take the value with the highest ballot
  • Send out phase 2 (P2A) messages for all values that have been accepted already
  • Can begin to send out phase 2 (P2A) messages as client requests arrive

33 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server2

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server3

Ballot: 1.3

Accepted: {}

34 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server2

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server3

Ballot: 1.3

Accepted: {}

Server 3 wants to become leader

35 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server2

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server3

Ballot: 1.3

Accepted: {}

Server 3 wants to become leader

P1A:

ballot: 1.3

36 of 78

Server1

Ballot: 1.3

Accepted: {1->(A, 1.2)}

Server2

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server3

Ballot: 1.3

Accepted: {}

Server 3 wants to become leader

P1B:

ballot: 1.3

accepted: {1->(A, 1.2)}

37 of 78

Server1

Ballot: 1.3

Accepted: {1->(A, 1.2)}

Server2

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server3

Ballot: 1.3

Accepted: {}

Server 3 wants to become leader

P1B:

ballot: 1.3

accepted: {1->(A, 1.2)}

P1B:

ballot: 1.3

accepted: {}

38 of 78

Server1

Ballot: 1.3

Accepted: {1->(A, 1.2)}

Server2

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server3

Ballot: 1.3

Proposals: {1->(A, 1.2)}

Accepted: {}

Leader: true

Server 3 is now leader

Delayed P1A

39 of 78

Server1

Ballot: 1.3

Accepted: {1->(A, 1.2)}

Server2

Ballot: 1.2

Accepted: {1->(A, 1.2)}

Server3

Ballot: 1.3

Proposals: {1->(A, 1.2)}

Accepted: {}

Leader: true

Server 3 is now leader

P2A

ballot: 1.3

slot: 1

command: A

Delayed P1A

40 of 78

Phase 2: Leader

  • When a server that thinks it is leader receives a request, it can skip phase 1 and immediately send out phase 2 (P2A) messages
    • P2A message should contain (ballot, slot number, command)
  • In PMMC, upon receiving a P2A, an acceptor should only accept it if the ballot in the message matches the acceptor’s ballot
    • This means the acceptor considers the sender to be the current leader
    • Prevents split brain problems
    • You can always reply with a P2B, as long as you don’t update the state if ballots don’t match!
      • But you might not need to respond
  • Need a way to sync logs between leader and all the acceptors to learn about missed decisions.
    • They can also sync up with heartbeat messages.
    • Alternatively, when the leader sends an accept message, attach the leader’s log with the request. When acceptor replies, attach the acceptor’s log with the response. Then update/merge each side’s log. (Although you should get normal p2a working first.) Lots of different possible protocols and implementations to consider!

41 of 78

Recommendation: Roles

  • As the leader:
    • Act as a proposer and acceptor from Basic Paxos
    • Propose requests from clients
      • First, check if the command has already been proposed, decided, or executed
    • Keeps replicas up to date
    • Send heartbeats to servers so they know the leader is alive
      • Can include garbage collection information in these messages
  • As anyone else (not the leader):
    • Drop client requests
    • Act only as an acceptor, not a proposer
      • That is, until the server believes the leader died. Then it should start phase 1 �(scout phase)

42 of 78

Garbage Collection

  • Problem: log (and thus memory usage) can grow indefinitely
  • Solution: garbage collect (delete) log entries that are no longer needed
  • Can discard log slots that everyone has already executed
  • Need a way to determine what slots other nodes are done with
    • One solution: piggyback this information on heartbeat replies
      • Needs some state about what everyone knows

43 of 78

Garbage Collection: 3 servers in the system

Server1 (leader):

Latest executed slot: 4

Server2:

Latest executed slot: 4

Server3:

Latest executed slot: 3

Hasn’t heard about slot 4

Can garbage collect slots 1-3

44 of 78

Garbage Collection: 5 servers in the system

Server1 (leader):

Latest executed slot: 4

Server2:

Latest executed slot: 4

Server3:

Latest executed slot: 3

Hasn’t heard about slot 4

Cannot garbage collect anything until all servers have executed a slot

Server4:

Latest executed slot: N/A

Server5:

Latest executed slot: N/A

45 of 78

Recommendation: slot_in and slot_out

Keep track of slot_in and slot_out from PMMC:

  • slot_in: Where the leader/distinguished proposer puts new proposals
    • Make sure that you don’t repropose things that have:
      • Already been executed [app.alreadyExecuted]
      • Already been proposed [already exists in the log]
    • Acceptors/Replicas: Updated when learning about greater slots
  • slot_out: first slot that hasn’t been decided on/executed in the log
    • Should be monotonically increasing and shouldn’t jump around
    • Execution of slots should happen in order
    • When executing, execute as much as possible until next unchosen slot

46 of 78

Lab 3 Questions, q1

Question: What’s the difference between a ballot number and a log slot?�

Answer:

  • Ballots are tuples of (round number, leader) and are used during phase 1 of Multi-Paxos (analogous to proposal numbers from Basic Paxos).
  • Log slots are more general – namely, each log slot corresponds to an instance of Paxos.
    • Basic Paxos has no notion of log slots because it is only used to decide on a single value
    • A single ballot can be used to decide multiple log slots
      • This means the leader did not change

47 of 78

Lab 3 Questions, q2

Question: When are you done with a log slot? How can you make sure that they’re garbage-collected efficiently?�

Answer: You are done when all servers in the Paxos group have executed the request. You can piggyback messages to let everyone know not only what you are done with, but what you think everyone else is done with.

  • Note: Some log slots may have not been decided because the proposer may have died, stopping the push to consensus on that log slot. You can resolve this by proposing a value, which will either drive any already accepted value to be chosen for that log slot, or will propose a no-op for that log slot.

48 of 78

Dealing with Log “Holes”/Proposing no-ops

If we have in our log:

put(a, 1), ______, ______, append(a, 2)

Do we know those empty log slots are decided? How can we push these log slots to completion and execute? Why do we need no-ops?

Your implementation needs to be able to handle "holes" in the Paxos log. That is, at certain points, a server might see agreement being reached on a slot but not previous slots. Your implementation should still make progress in this case.

49 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.2

Accepted: {2->(B, 1.2)}

Server3

Ballot: 1.3

Active: false

Accepted: {}

Log: {}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

*Log only contains chosen slots in this example

50 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.2

Accepted: {2->(B, 1.2)}

Server3

Ballot: 1.3

Active: false

Accepted: {}

Log: {}

P1A

ballot: 1.3

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

*Log only contains chosen slots in this example

51 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {2->(B, 1.2)}

Server3

Ballot: 1.3

Active: false

Accepted: {}

Log: {}

P1B

ballot: 1.3

accepted: {2->(B, 1.2)}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

*Log only contains chosen slots in this example

52 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {2->(B, 1.2)}

Server3

Ballot: 1.3

Active: false

Accepted: {}

Log: {}

P1B

ballot: 1.3

accepted: {2->(B, 1.2)}

P1B

ballot: 1.3

accepted: {}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

*Log only contains chosen slots in this example

53 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {2->(B, 1.2)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Log: {}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

*Log only contains chosen slots in this example

54 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {2->(B, 1.2)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Received P2B from: {}

Log: {}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

P2A

ballot: 1.3

slot number: 1

command: no-op

*Log only contains chosen slots in this example

55 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {2->(B, 1.2)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Received P2B from: {}

Log: {}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

P2A

ballot: 1.3

slot number: 2

command: B

*Log only contains chosen slots in this example

56 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {2->(B, 1.3)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Received P2B from: {}

Log: {}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

P2B

ballot: 1.3

slot number: 2

*Log only contains chosen slots in this example

57 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {2->(B, 1.3)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Received P2B from: {2->[2]}

Log: {}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

P2B

ballot: 1.3

slot number: 2

*Log only contains chosen slots in this example

58 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {2->(B, 1.3)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Received P2B from: {2->[2,3]}

Log: {2->B}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

P2A

ballot: 1.3

slot number: 1

command: no-op

*Log only contains chosen slots in this example

59 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {1->(no-op, 1.3), 2->(B, 1.3)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Received P2B from: {2->[2,3]}

Log: {2->B}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

P2B

ballot: 1.3

slot number: 1

*Log only contains chosen slots in this example

60 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {1->(no-op, 1.3), 2->(B, 1.3)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Received P2B from: {1->[2], 2->[2,3]}

Log: {2->B}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

P2B

ballot: 1.3

slot number: 1

*Log only contains chosen slots in this example

61 of 78

Server1

Ballot: 1.2

Accepted: {1->(A, 1.2), 2->(B, 1.2)}

Server2

Ballot: 1.3

Accepted: {1->(no-op, 1.3), 2->(B, 1.3)}

Server3

Ballot: 1.3

Active: true

Proposals: {2->(B, 1.2)}

Received P2B from: {1->[2, 3], 2->[2,3]}

Log: {1->no-op, 2->B}

Server 3 log:

Not started

In progress

Value chosen

1

2

3

4

...

*Log only contains chosen slots in this example

62 of 78

Slot #

1

2

3

4

5

6

Candidate (s4)

(C, _, v)

(A, (2, s4), h)

(A, (3, s2), x)

Follower 1 (s3)

(A, (2, s5), y)

(C, _, x)

(A, (3, s5), b)

Follower 2 (s1)

(C, _, v)

(C, _, y)

(A, (3, s5), x)

(A, (4, s2), g)

Slot #

1

2

3

4

5

6

Leader (s4)

Chosen: v

Chosen: y

Propose: NO-OP with ballot (5, s4)

Chosen: x

Propose: g with ballot �(5, s4)

Example of merging logs (from P1Bs, with optimization of also sending chosen slots)�Suppose there are 5 servers. �(s4 trying to become leader with ballot number (5, s4).

([C]hosen/[A]ccepted, (Ballot), value)�Grey is executed

What was/is slot_in?

What was/is slot_out?

What should the leader do next?

How does s3 (+ other nodes) learn that slot 1 and 2 were chosen?

63 of 78

Cycling Leaders/Late Rejections*

Consider the case with 5 servers:

Server 4 is leader.

Server 4 sends P2A for value v in slot 1 with ballot number 1, 4⟩ to every server. ← delayed

Server 5 sends P1A to every server with ballot ⟨1, 5⟩.

Servers 1 and 2 accept ballot ⟨1, 5⟩ and send P1B.

Server 5 then goes down.

Server 4 sees this and starts scouting sending P1A with ballot 2, 4⟩.

Servers 1 and 2 accept ballot 2, 4⟩ and send P1B.

Server 4 is now the active leader.

Server 4 sends P2A for value v in slot 1 with ballot ⟨2, 4⟩ to every server.

Server 2 then receives the old P2A with value v in slot 1 and ballot ⟨1, 4⟩.

Server 2 rejects this P2A and then sends back a P2B with the ballot 2, 4⟩ without accepting the P2A.

Server 4 gets the P2B with ballot ⟨2, 4⟩ and believes server 2 has accepted the new P2A, when it has not.

*=For people who follow PMMC REALLY closely

64 of 78

Possible order for implementation of Paxos

  1. Ballots
  2. Log(s)
  3. Accept messages/timers (P2As, P2Bs, Decisions)
    1. You can start by hard-coding one of the servers as the leader
    2. Make sure executions are happening in order (starting from slot_out and don’t skip slots) and progress can be made for clients
  4. Leader election/Prepares messages/timers (P1As, P1Bs)
    • Start leader election at the very start
    • Check if leader is alive
  5. Heartbeats
    • Catch-up mechanism
  6. Garbage collection

65 of 78

Some cases/questions to consider when writing your code

  • Receiving a request that’s already in your log as a proposer
  • When do holes come up in your log? When should you propose no-ops?
  • What happens when you receive a larger ballot in a message?
  • How does a server know who the leader is?
  • If you receive an accept message for a slot that you already know is chosen, what should you do?
  • Why do we execute the log in order?
  • What happens when a leader dies?

66 of 78

Misc. tips

  • Don’t use objects to separate the roles, as that might lead to unnecessary/extra states if not done carefully
    • If you’re failing the search tests, run with --checks to make sure that you had @Data and stuff where it’s necessary
  • You don’t need WINDOW because we aren’t reconfiguring
  • You don’t need to re-propose alreadyExecuted commands
  • http://paxos.systems/ is easier to parse than PMMC
  • Try to get Lab 3 to work because Lab 4 is built on top of Lab 3
  • Try to understand how the changes you make impacts the system so you aren’t scared about adding/changing/removing code you write
  • Don’t put things back in slots that have already been cleared

67 of 78

Tips for debugging

  • If you finished prepares and accepts, you can debug them by using search test 20 and looking for invariant-violating states on the visualizer.
    • If you fail search tests due to too many states explored, try running the test with --checks and make sure that you have @Data or @EqualsAndHashcode for your classes
    • If you failed search tests due to not having a majority accept for a chosen command, but the visualizer says that you did get a majority accept, it’s possible you had an error in your interface methods (command(...) and status(...))
  • For the random search tests, you can change the random depths to shorter depths so that you can get easier traces to step through.
    • In ./labs/lab3-paxos/tst/dslabs/paxos/PaxosTest.java, change maxDepth(1000) to like maxDepth(20)
    • Make sure your interface methods work, since the tests use those to check things, e.g. status, lastNonEmpty, etc.
    • Remember to change it back!
  • Inner classes must be static, otherwise it implicitly creates a copy of the parent class when serialized. Also applies to lab 4; might lead to random, weird timeouts/slowdowns.

68 of 78

If you fail test 8 and...

Client workload wasn’t finished

  • Phase 2 probably doesn’t work right

It’s due to a timeout

  • Make sure that the interface methods are fast (avoid iterating over things if you can, but it doesn’t matter if you have garbage collection)

69 of 78

If you fail test 10 and it takes too long...

  • Make sure that you reply to the client right away when you execute and not waiting for the client to contact you again in order to reply

70 of 78

If you fail 16,17,18

  • Try debugging by saving the logging output into a file and looking at regions where the responses for the client that it said waited too long actually waited too long
  • Common issues:
    • Not doing no-ops
    • Not doing leader election properly,
      • e.g. not allowing servers who previously tried to get elected try again
      • Not reproposing/setting timers for proposing accepted slots
    • Followers trying to become leader too fast
  • Maybe add print statements when you come to a decision and when you receive a response to get an idea of what’s causing the slowdown

71 of 78

Some other tips for debugging 17,18

You can add print statements and run 17 and 18 (not writing it to a file) to see where it gets stuck.

Suggested positions:

  • When leaders get elected
  • When a command becomes chosen
  • When clients get the correct result back

72 of 78

If you fail 23 because...

Search space exhausted:

  • Make sure you’re properly resetting timers for things like leader elections, heartbeats, or p2as

Could not find state matching [...]:

  • Reduce the number of timers that you set/messages that you send

73 of 78

If you see servers that are supposed to be dead with timers triggering...

  • Check to make sure that you’re not exponentially setting timers, e.g. one timer calling a method that sets a timer and resetting the original timer

74 of 78

If you see that a handler isn’t idempotent

You can run the visualizer by itself and try to get to a state in which you send the message with the handler that it says isn’t idempotent, then right-click the message and click “duplicate”. Then process those messages in order. Note that the network model uses a set of messages, so handling a message again shouldn’t create a brand new message/timer.

To run the visualizer by itself, do something like this:

  • ./run-tests.py -l 3 --debug 3 2 "PUT:foo:bar,APPEND:foo:baz,GET:foo"

75 of 78

Miscellaneous design things we saw that we thought were good ideas

  • Have HeartbeatCheckTimers be set with a ballot so that you can start and stop HeartbeatCheckTimers more easily, especially for preempted leaders/candidates
  • Keeping track of slot_in and slot_out on followers for lastNonEmpty and other interface methods
  • Just merge the logs as you receive them when you’re trying to become the leader so that you don’t have to store logs around

76 of 78

Some memes to remind you that your struggle was also experienced by students in previous quarters

77 of 78

New meme proposals

78 of 78

We are accepting more memes, plz & tks