1 of 50

Section 2: Lab 2

(Primary Backup)

CSE 452 Spring 2026

2 of 50

Announcements

  • Pset 1 & Lab 1 code, Design Doc due yesterday (04/08) [done individually]
  • Review syllabus for late day policy
    • 48 hr grace period for lab 1
    • 48 hr grace period for Pset 1
    • NO grace period for design doc
  • Lab 2: Work on only one partner’s repo and share it with the other partner
    • Design doc due next Friday (04/17), no grace period
  • Pset 2 also due next Friday

3 of 50

If you haven’t already...

Add your partner to your GitLab repo specified by following these steps:

  • Going to the designated GitLab repo that y’all decided to work on
  • Going to Members
  • Looking up your partner’s NETID
  • Setting their role to Maintainer/Developer
  • Click “Invite”

4 of 50

Recap 6 Rules:

  1. Primary must wait for backup to accept/execute each op before doing op and replying to client
  2. Backup must accept forwarded requests only if view is correct
  3. Primary in view i+1 must have been backup or primary in view i
  4. Non-primary must reject client requests
  5. ViewServer cannot move on to view i+1 until ack from primary in view i is received
  6. Every operation must be before or after state transfer (Problem with processing requests during a state transfer)

5 of 50

Why Primary-Backup?

6 of 50

Want to achieve…

Replication!

If primary dies…there’s a backup!

7 of 50

General Flow

Pool of Idle Servers

Client

Primary

Backup

View Server

Request

Forward Request

Forward

Ack

Response

Ping

Ping

Ping

GetView

What messages do they send/receive?

ViewReply (VR)

VR

VR

VR

8 of 50

What’s a view? (in the labs)

9 of 50

The View Server

  • Determines the primary, backup, and view number
  • Goes through sequence of views (like a sequence number)
  • Responds to servers and clients that ping’d the VS to learn what the actual current view is.
    • server/client’s view may trail behind.
  • Keeps track of who is alive/dead, transitioning to a new view if primary or backup dies.
  • Single point of failure :(

10 of 50

Receiving and Responding to Pings

Pinging the View Server:

  1. Confirmation that the server is alive
  2. The most recent view the server knows
    1. Primary needs to acknowledge the current view.

Replying to Pings:

View Server’s reply to a Ping informs the server of the current view

Idle Servers

Primary

Backup

View Server

Ping(i)

Ping(i)

Ping(i)

current view: i

11 of 50

When can the VS move on to a new view?

From start-up view (special case) with no primary/backup:

  • Can transition to initial view

For any other view (general case):

  • When the current view is acknowledged
  • A primary or backup dies

12 of 50

When can the VS move on to a new view?

View: STARTUP_VIEW_NUM

Primary: null

Backup: null

First view

View: INITIAL_VIEW_NUM

Primary: A

Backup: null

Server A Ping’d View Server

View:

INITIAL_VIEW_NUM + 1

Primary: A

Backup: B

Server B Ping’d View Server

View Server cannot transition to INITIAL_VIEW_NUM + 1 until Primary (A) acknowledges View INITIAL_VIEWNUM

Q: Can View1(primary=A, backup=null) change to View2(primary=B, backup=C) if A died?

13 of 50

How is this done in code?

  • More specifically, when does a View Server determine that a primary or backup has died in order to move on to a newer view?
    • Using some sort of timer, we can periodically check which servers have ping’d within a given interval.
    • How? Updating and resetting some data structure(s)
    • Why do we need to keep track of other idle servers? If backup dies, can promote an idle servers as a backup (but NEVER a primary)

14 of 50

PingCheckTimer Semantics

PingCheckTimer

PingCheckTimer

S1 ping

S2 ping

Here, at the first PingCheckTimer, S1 and S2 are still considered alive.

Here, at the second PingCheckTimer, S1 is still alive, but since S2 did not ping since the last PingCheckTimer, S2 is considered dead.

0ms

100ms

200ms

“If the ViewServer doesn't receive a Ping from a server since the last PingCheckTimer, it should consider the server to be dead”

For lab 2: Do NOT store timestamps on the view server! Needs to be deterministic for search tests.

(2nd most recent pingcheck timer)

( most recent pingcheck timer)

15 of 50

When can the VS move on to a new view?

  • First view is (STARTUP_VIEWNUM): {null, null}
  • The first ping of some server (server A) should result in transition of startup view to INITIAL_VIEWNUM (should be {primary=A, null})
  • View INITIAL_VIEWNUM+1 should be {primary=A, backup=B} if there is a backup (server B) available
  • Primary acknowledges first non-null view (INITIAL_VIEWNUM) with its own ping
    • What if a server has pinged since?
      • Should be added as backup when the primary acknowledges (In other words, transition to a new view)

16 of 50

View Transition Timeouts (worksheet q2)

  • Only move to a new view (i + 1) if the primary of view (i) has acknowledged view (i)!
  • What happens if the primary fails? (assume current view has been ack'd)

  • What happens if backup fails?

  • What happens if the primary fails with no backup?

  • What happens if both primary and backup fail?

Worksheet Q2

17 of 50

View Transition Timeouts

  • Only move to a new view (i + 1) if the primary of view (i) has acknowledged view (i)!
  • What happens if the primary fails? (assume current view has been ack'd)
    • Backup becomes new primary and try to get another backup if you can in View i+1
  • What happens if backup fails?
    • New server should become backup or null backup in View i+1
  • What happens if the primary fails with no backup?
    • Just do nothing - hope that it comes back
  • What happens if both primary and backup fail?
    • Cry :( -> Also do nothing

Worksheet Q2

18 of 50

Example Call Flow Diagram

(S1 crash)

Server 1

View Server

Server 2

Ping(0)

View 1 {S1, null}

Ping(0)

View 1 {S1, null}

Ping(1)

View 2 {S1, S2}

Ping(1)

View 2 {S1, S2}

Ping(2)

View 2 {S1, S2}

S1 crashes

Ping(2)

View 3 {S2, null}

Server 3

Ping(0)

Ping(3)

View 4 {S2, S3}

View 3 {S2, null}

*

* S1 sends application state to S2 and gets an ack back before Ping(2), acknowledging the view

19 of 50

Call flow Diagram (worksheet q3)

  • At what point is View 1 acknowledged by S1?

It is acknowledged at point e.

  • need to wait for the primary server to ack back with the current view number before considering a view as acknowledged.

Worksheet Q3

20 of 50

Example Call Flow Diagram

(S2 crash)

Server 1

View Server

Server 2

Ping(0)

View 1 {S1, null}

Ping(0)

View 1 {S1, null}

Ping(1)

View 2 {S1, S2}

Ping(1)

View 2 {S1, S2}

Ping(2)

View 2 {S1, S2}

Server 3

Ping(0)

View 4 {S1, S3}

*

* S1 sends application state to S2 and gets an ack back before Ping(2), acknowledging the view

S2 crashes

Ping(2)

View 3 {S1, null}

Ping(3)

View 4 {S1, S3}

**

** Same as * except with S3 instead of S2 and Ping(4) (which isn’t drawn)

View 3 {S1, null}

Ping(3)

21 of 50

Call flow Diagram (worksheet q4)

  • At what point does ViewServer move on to View 2? What will the primary and backup in View 2 be given the following call flow diagram?
  • Primary is S1, Backup is S3 in View 2.

Worksheet Q4

22 of 50

Primary and Backup

  • Only the Primary responds to the client.
    • When a non-primary server gets a client request? → Ignore the request!
  • Primary should pass requests to the backup and receive an ack before executing and responding to the client.
  • What needs to be done when primary has a new backup?
    • Transfer state to backup
      • TIP: Send entire AMOApplication in a new message type
    • Ignore any incoming requests until state transfer complete

23 of 50

Analysis: Processing Multiple Requests Simultaneously

Primary

Backup

Put(“a”, “foo”)

Put(“a”, “foo”)

“a” -> “foo”

Client 1

Put(“a”, “foo”)

Client 2

Put(“a”, “bar”)

Put(“a”, “bar”)

Put(“a”, “bar”)

“a” -> “bar”

“a” -> “bar”

“a” -> “foo”

24 of 50

Remember the rules!

  1. Primary in view i+1 must have been backup or primary in view i
  2. ViewServer cannot move on to view i+1 until ack from primary in view i is received
  3. Primary must wait for backup to accept/execute each op before doing op and replying to client
  4. Backup must accept forwarded requests only if view is correct
  5. Non-primary must reject client requests
  6. Every operation must be before or after state transfer

25 of 50

Questions So Far?

26 of 50

Design Doc Tips

27 of 50

Why Design Docs?

  • Overwhelmingly positive feedback last quarter
  • Treat the design doc as the challenging part of the labs
  • Our distributed system is a state machine with 2 possible actions
    • A server receives a message
    • A server fires a timer
  • Design doc defines what we do for each of those actions
  • A comprehensive design doc will in theory catch all of our edge cases!

28 of 50

Things to Design

  • Preface & Conclusion help us set up the problem
    • What are the cases we need to handle? What can we ignore?
    • Fault model for this class is defined for us
  • Protocol defines how we achieve our goals
  • Correctness/Liveness Analysis helps convince us our design actually works

29 of 50

Good Design Doc Practices

  • You should be able to hand another student/TA your design doc and they should be convinced your design works
  • Keep our designs application & language agnostic
  • Explain not only the “what”, but also the “how”
  • Be concise and use bullet points!

30 of 50

Example Protocol for At least Once RPC (Lab 1 Part 2)

  • Kinds of nodes:
    • There are two kinds of nodes: clients and servers.
    • There can be any number of clients and any number of servers.
  • State at each kind of node:
    • Client:
      • Sequence number
        • What is it? Integer, sequence # of our current request
        • Starts at 0, increases by 1 when client sends a command
      • Current Request
        • What is it? Request, the current request we are working on
        • Starts as null, gets set when client sends a command
      • Last Reply
        • What is it? Reply, the reply to the current we are working on
        • Starts as null, get set to the reply when we hear from the server. Reset to null when we send a new command
    • Server:
      • No evolving state (Stores only an application)

31 of 50

Example Protocol for At least Once RPC (Lab 1 Part 2)

  • Messages:
    • Request Message
      • Source: Clients
      • Destination: Servers
      • Contents
        • Command to be executed
        • A sequence number
      • When is it sent?
        • Whenever a client wants to invoke a command
        • Client sets its current request to this message, its last reply to null, and then sends this message to the server
        • Client sets the client timer to resend the message
      • What happens at the destination when it is received?
        • The server passes the command to the server’s application and executes it
        • The server takes the result from the application, wraps it in a Reply message and sends it to the client that sent the Request message

32 of 50

Example Protocol for At least Once RPC (Lab 1 Part 2)

  • Messages:
    • Reply message
      • Source: Servers
      • Destination: Clients
      • Contents:
        • The reply from the application
        • A sequence number (integer)
    • When is it sent?
      • When a server receives a request message, it executes the command in the request and then responds to the client with a Reply message
    • What happens at the destination when it is received?
      • The client checks if the Reply message corresponds to the Request it is currently working on.
      • If it matches, the client sets it’s Reply field to this reply

33 of 50

Example Protocol for At least Once RPC (Lab 1 Part 2)

  • Timers:
    • RequestRetransmit
      • Set by clients
      • Contents: a sequence number (integer)
      • Set whenever a client sends a new RPC
      • What happens when it fires?
        • The client checks if the timer's sequence number is the same as the sequence number of the current request on the client. If it isn’t, ignore the timer.
        • Otherwise, the client retransmits the current request, and resets the timer

34 of 50

Where to Start

  • Messages
    • Request / Reply
    • Forward msg / forward msg Ack between P and B
    • State transfer / State transfer Ack
    • Others?
  • Timer handlers
    • Ensure proper checks on timer handler, avoid calling set() if response to the message that the timer was set for was received successfully
  • States needed to keep for PBClient/PBServer
    • AMOApplication (only server)
    • Sequence number (on client)
    • Current View

35 of 50

Lab 2 Debugging Tips

36 of 50

General tips

  • Primary should send commands to backup, backup should process command; send response to primary; primary processes the command; then the primary responds to client
    • What should primary do if it gets a request for a new command before it receives an acknowledgement for the previous command from the backup?
      • Dropping requests is simplest option
      • We recommend just processing one request at a time
    • Ensure primary and backup agree on current view
  • Send view with every message and check viewNum on each receive
  • Non primary should reject any requests!
  • Simplify states as much as possible, and don’t use more timers than necessary

37 of 50

State Transfer - Things to keep in mind

  • If primary receives a new view with a backup - need to do state transfer
  • Include all data (gets, puts, appends) and all RPC history
    • Sending entire AMOApplication in a message should be sufficient
  • Can’t process any requests while state transfer in progress (why?)
    • Primary should drop requests while state transfer in progress; client will retry
  • Backup can receive duplicated/late state transfer messages �(i.e. if state transfer message is duplicated/come later).
    • Ensure that state on backup is only overwritten once per view change
    • What happens if the ack gets dropped?
  • Usual retry logic applies
    • State transfer messages and acks can be dropped
  • Pings from primary during state transfer should reflect old view.
    • So the primary only moves to new view once state transfer is complete.

38 of 50

Additional Lab 2 hints/help

  • Follow tips on slides
    • Especially slides about processing multiple requests & processing requests during state transfer
  • Make sure that the view server can only move to a new view after the current primary acks the current view
  • Add timers only when necessary, i.e. if things need to be retried
  • You might want to run the test suite multiple times in case there are some infrequent errors
  • You might need to tune your timers, since it takes time to process messages, e.g. 10ms is generally unreasonable

39 of 50

Even more Lab 2 hints

  • Don't need 'curr view' and 'next view' in PBServer
    • in handleViewReply, we don't care what the node WAS, we only care about what it is becoming in the new view, so it’s unnecessary to store more than 1 view
  • Consider pinging with old view or not pinging at all during state transfer
  • On client: easiest to ask for current view whenever there is a timeout
  • Things that can increase the number of states to explore:
    • Every state transition: setting new timers, sending new messages
    • Unnecessary state information, e.g. a retry counter that just keeps incrementing

40 of 50

Some debugging tips

Run

Search

Invariant Violations

-g FINER &> log.txt → might need to print out multiple runs if issue does not appear every time

(printlns also helpful)

Visual Debugger → retrace steps that led to the invariant violation

Liveness Violation or Timeout

-g FINER &> log.txt → look for patterns in log file

(printlns also helpful)

Visual Debugger → try to drive system towards goal (under the constraints of the test)

41 of 50

Some debugging tips

  • To debug things taking too long, you can run it with logging (-g FINER or -g FINEST) and write it to a file using &>
    • Look for stretches of text where the system makes no progress, e.g. repeated timers/handlers and no replies
  • To add your own logging
    • import lombok.extern.java.Log
    • annotate class with @Log
    • LOG.info(“Some message here”) or LOG.fine(“Some message here”), and use -g FINE or -g INFO to see your logs
      • Why log rather than println? Easy to turn off logging in search tests.
  • Write clean-ish code!
    • Commenting/refactoring your code can be pretty annoying but can also be super useful when you’re looking back and trying to debug code you wrote a week ago (or even a day ago in some bad cases lol)

42 of 50

Run Test Debugging: Logging

  • What to avoid: “Hello”, “HelloHi”, & “HelloHiBye” …
  • Idea is to glance at the log and get a general idea of system state
  • Make a logging function that systematically outputs the log with lots of information
    • Example: “<method_name>, <line_number>, <sender>, <receiver>, <your_custom_msg>”
    • Produces output: “OnHandleRequest, Line 143, Server1, Client3, Received requests from client 3”
    • Adding timestamps can be helpful too
  • Notice the “,”? You can save log as a “.csv” file and export it as a spreadsheet. (Alternatively separate by tabs and save as a .tsv)
    • Easy Searching
    • Filter rows based on constraints
    • Changing Layout
    • Conditional Row Coloring
    • … You get the idea :)
  • Bonus: Check out Java Logger Interface if you want to create custom loggers. Can do even more … like having custom logging levels and custom tags

43 of 50

Search Debugging: Invariant Violations

  • Don’t use logging (-g FINEST) or printlns for the search tests
    • The search tests do a BFS/DFS through search states, so you may see inconsistent messages
  • To debug invariant violations:
    • Open the trace in the debugger, read the invariant tooltip (if present) to see what went wrong.
    • Step through the trace to see how the invariant was violated
    • Determine if it’s an issue in your design or an inconsistency between your design and implementation
      • A common error is forgetting that the network can duplicate and reorder messages
    • In the debugger, you can look at the app state on the servers to see if something is amiss

44 of 50

Search Debugging: Liveness Violations

  • To debug liveness violations (“could not find state matching ...”):
    • Read the test (optional, but useful).
      • Find which search fails, and go to that particular bfs within the test.
      • Read the search settings, focus on the network settings.
    • In the debugger, read the goals and prunes of the test.
    • Figure out how you should be able to reach the goal within the restrictions of the search.
    • Take steps in the debugger to reach the goal.
      • If you can’t reach the goal: either need to fix the protocol (the idea of what should happen) or the implementation (the idea of what should happen doesn’t match what’s in your code).
      • If you can reach the goal, modify the code so that the search space is easier to explore.
        • make the search space smaller
        • make the trace shorter

45 of 50

Search Debugging: Liveness Violations

  • Make the search space smaller
    • Run the test with checks
      • Fix all equality issues (add @Data or replace types which don’t override equals) and non-determinism.
      • See if idempotency issues need to be fixed. Some non-idempotence is OK.
    • Ensure that retried messages are identical, and avoid retry counts
    • Confirm that timer queues don’t grow without bound (don’t set the same timer twice in the same handler).
    • For resend/discard timers, make sure that old timers are dropped.
  • Make the trace shorter
    • Common mistake: set a timer, wait for the timer to fire, then send some message.
      • Instead: send the message immediately, then set the timer

46 of 50

Other common issues

  • Think about what should happen when you get duplicated/delayed messages:
    • State Transfers
    • State Transfer acks
    • Forward Requests
    • Forward Replies
    • View Replies
  • Just because you pass the ViewServer tests, it doesn’t mean it’s 100% right
    • You shouldn’t be going from (primary, null) -> (null, null)

47 of 50

Pro Tips

1. If you’re failing test 2.19 and only test 2.19, try removing maps in the view server if you have any

Could not find state matching "All clients' workloads finished"

48 of 50

Some other tips for search tests

  • This means you’re generating too many states
  • Might be due to not having a clone/hashCode/equals
    • Try running --checks
      • Fix any equals or hashCode errors
      • Sometimes it’s okay for things to not be idempotent. It’s better if they are, but sometimes it’s okay.
  • Possibly due to receiving multiple copies of a message producing multiple messages/timers
    • E.g. getting the same handleViewReply produces multiple state transfer timers/messages
  • Servers should ping the ViewServer to get a view

Could not find state matching "All clients' workloads finished"

49 of 50

More Tips

If you get “not all clients’ results are the same”

  • Make sure that state transfers only happen once per view
  • Make sure that you only handle a single request at a time
    • You should check the forward replies and make sure that it’s for the same command as the one you’re waiting for

GetResult(“x”, “xy”)

GetResult(“x”, “yx”)

50 of 50

If you’re timing out on test 2.14/2.15

Try putting print statements on view changes and when you handle results on the clients as a point of departure for debugging

You can reply right away to alreadyExecuted commands, so you don’t need to forward it to the backup and get an ack back