1 of 50

Section 2: Lab 2

(Primary Backup)

CSE 452 Spring 2026

2 of 50

Announcements

Pset 1 & Lab 1 code, Design Doc due yesterday (04/08) [done individually]
Review syllabus for late day policy

48 hr grace period for lab 1
48 hr grace period for Pset 1
NO grace period for design doc

Lab 2: Work on only one partner’s repo and share it with the other partner

Design doc due next Friday (04/17), no grace period

Pset 2 also due next Friday

3 of 50

If you haven’t already...

Add your partner to your GitLab repo specified by following these steps:

Going to the designated GitLab repo that y’all decided to work on
Going to Members
Looking up your partner’s NETID
Setting their role to Maintainer/Developer
Click “Invite”

4 of 50

Recap 6 Rules:

Primary must wait for backup to accept/execute each op before doing op and replying to client
Backup must accept forwarded requests only if view is correct
Primary in view i+1 must have been backup or primary in view i
Non-primary must reject client requests
ViewServer cannot move on to view i+1 until ack from primary in view i is received
Every operation must be before or after state transfer (Problem with processing requests during a state transfer)

5 of 50

Why Primary-Backup?

6 of 50

Want to achieve…

Replication!

If primary dies…there’s a backup!

7 of 50

General Flow

Pool of Idle Servers

Client

Primary

Backup

View Server

Request

Forward Request

Forward

Ack

Response

Ping

GetView

What messages do they send/receive?

ViewReply (VR)

VR

8 of 50

What’s a view? (in the labs)

9 of 50

The View Server

Determines the primary, backup, and view number
Goes through sequence of views (like a sequence number)
Responds to servers and clients that ping’d the VS to learn what the actual current view is.

server/client’s view may trail behind.

Keeps track of who is alive/dead, transitioning to a new view if primary or backup dies.
Single point of failure :(

10 of 50

Receiving and Responding to Pings

Pinging the View Server:

Confirmation that the server is alive
The most recent view the server knows

Primary needs to acknowledge the current view.

Replying to Pings:

View Server’s reply to a Ping informs the server of the current view

Idle Servers

Primary

Backup

View Server

Ping(i)

current view: i

11 of 50

When can the VS move on to a new view?

From start-up view (special case) with no primary/backup:

Can transition to initial view

For any other view (general case):

When the current view is acknowledged
A primary or backup dies

12 of 50

When can the VS move on to a new view?

View: STARTUP_VIEW_NUM

Primary: null

Backup: null

First view

View: INITIAL_VIEW_NUM

Primary: A

Backup: null

Server A Ping’d View Server

View:

INITIAL_VIEW_NUM + 1

Primary: A

Backup: B

Server B Ping’d View Server

View Server cannot transition to INITIAL_VIEW_NUM + 1 until Primary (A) acknowledges View INITIAL_VIEWNUM

Q: Can View1(primary=A, backup=null) change to View2(primary=B, backup=C) if A died?

13 of 50

How is this done in code?

More specifically, when does a View Server determine that a primary or backup has died in order to move on to a newer view?

Using some sort of timer, we can periodically check which servers have ping’d within a given interval.
How? Updating and resetting some data structure(s)
Why do we need to keep track of other idle servers? If backup dies, can promote an idle servers as a backup (but NEVER a primary)

14 of 50

PingCheckTimer Semantics

PingCheckTimer

S1 ping

S2 ping

Here, at the first PingCheckTimer, S1 and S2 are still considered alive.

Here, at the second PingCheckTimer, S1 is still alive, but since S2 did not ping since the last PingCheckTimer, S2 is considered dead.

0ms

100ms

200ms

“If the ViewServer doesn't receive a Ping from a server since the last PingCheckTimer, it should consider the server to be dead”

For lab 2: Do NOT store timestamps on the view server! Needs to be deterministic for search tests.

(2nd most recent pingcheck timer)

( most recent pingcheck timer)

15 of 50

When can the VS move on to a new view?

First view is (STARTUP_VIEWNUM): {null, null}
The first ping of some server (server A) should result in transition of startup view to INITIAL_VIEWNUM (should be {primary=A, null})
View INITIAL_VIEWNUM+1 should be {primary=A, backup=B} if there is a backup (server B) available
Primary acknowledges first non-null view (INITIAL_VIEWNUM) with its own ping

What if a server has pinged since?

Should be added as backup when the primary acknowledges (In other words, transition to a new view)

16 of 50

View Transition Timeouts (worksheet q2)

Only move to a new view (i + 1) if the primary of view (i) has acknowledged view (i)!
What happens if the primary fails? (assume current view has been ack'd)

What happens if backup fails?

What happens if the primary fails with no backup?

What happens if both primary and backup fail?

Worksheet Q2

17 of 50

View Transition Timeouts

Only move to a new view (i + 1) if the primary of view (i) has acknowledged view (i)!
What happens if the primary fails? (assume current view has been ack'd)

Backup becomes new primary and try to get another backup if you can in View i+1

What happens if backup fails?

New server should become backup or null backup in View i+1

What happens if the primary fails with no backup?

Just do nothing - hope that it comes back

What happens if both primary and backup fail?

Cry :( -> Also do nothing

Worksheet Q2

18 of 50

Example Call Flow Diagram

(S1 crash)

Server 1

View Server

Server 2

Ping(0)

View 1 {S1, null}

Ping(0)

View 1 {S1, null}

Ping(1)

View 2 {S1, S2}

Ping(1)

View 2 {S1, S2}

Ping(2)

View 2 {S1, S2}

S1 crashes

Ping(2)

View 3 {S2, null}

Server 3

Ping(0)

Ping(3)

View 4 {S2, S3}

View 3 {S2, null}

*

* S1 sends application state to S2 and gets an ack back before Ping(2), acknowledging the view

19 of 50

Call flow Diagram (worksheet q3)

At what point is View 1 acknowledged by S1?

It is acknowledged at point e.

need to wait for the primary server to ack back with the current view number before considering a view as acknowledged.

Worksheet Q3

20 of 50

Example Call Flow Diagram

(S2 crash)

Server 1

View Server

Server 2

Ping(0)

View 1 {S1, null}

Ping(0)

View 1 {S1, null}

Ping(1)

View 2 {S1, S2}

Ping(1)

View 2 {S1, S2}

Ping(2)

View 2 {S1, S2}

Server 3

Ping(0)

View 4 {S1, S3}

*

* S1 sends application state to S2 and gets an ack back before Ping(2), acknowledging the view

S2 crashes

Ping(2)

View 3 {S1, null}

Ping(3)

View 4 {S1, S3}

**

** Same as * except with S3 instead of S2 and Ping(4) (which isn’t drawn)

View 3 {S1, null}

Ping(3)

21 of 50

Call flow Diagram (worksheet q4)

At what point does ViewServer move on to View 2? What will the primary and backup in View 2 be given the following call flow diagram?

Primary is S1, Backup is S3 in View 2.

Worksheet Q4

22 of 50

Primary and Backup

Only the Primary responds to the client.

When a non-primary server gets a client request? → Ignore the request!

Primary should pass requests to the backup and receive an ack before executing and responding to the client.
What needs to be done when primary has a new backup?

Transfer state to backup

TIP: Send entire AMOApplication in a new message type

Ignore any incoming requests until state transfer complete

23 of 50

Analysis: Processing Multiple Requests Simultaneously

Primary

Backup

Put(“a”, “foo”)

“a” -> “foo”

Client 1

Put(“a”, “foo”)

Client 2

Put(“a”, “bar”)

“a” -> “bar”

“a” -> “foo”

24 of 50

Remember the rules!

Primary in view i+1 must have been backup or primary in view i
ViewServer cannot move on to view i+1 until ack from primary in view i is received
Primary must wait for backup to accept/execute each op before doing op and replying to client
Backup must accept forwarded requests only if view is correct
Non-primary must reject client requests
Every operation must be before or after state transfer

25 of 50

Questions So Far?

26 of 50

Design Doc Tips

27 of 50

Why Design Docs?

Overwhelmingly positive feedback last quarter
Treat the design doc as the challenging part of the labs
Our distributed system is a state machine with 2 possible actions

A server receives a message
A server fires a timer

Design doc defines what we do for each of those actions
A comprehensive design doc will in theory catch all of our edge cases!

28 of 50

Things to Design

Preface & Conclusion help us set up the problem

What are the cases we need to handle? What can we ignore?
Fault model for this class is defined for us

Protocol defines how we achieve our goals
Correctness/Liveness Analysis helps convince us our design actually works

29 of 50

Good Design Doc Practices

You should be able to hand another student/TA your design doc and they should be convinced your design works
Keep our designs application & language agnostic
Explain not only the “what”, but also the “how”
Be concise and use bullet points!

30 of 50

Example Protocol for At least Once RPC (Lab 1 Part 2)

Kinds of nodes:

There are two kinds of nodes: clients and servers.
There can be any number of clients and any number of servers.

State at each kind of node:

Client:

Sequence number

What is it? Integer, sequence # of our current request
Starts at 0, increases by 1 when client sends a command

Current Request

What is it? Request, the current request we are working on
Starts as null, gets set when client sends a command

Last Reply

What is it? Reply, the reply to the current we are working on
Starts as null, get set to the reply when we hear from the server. Reset to null when we send a new command

Server:

No evolving state (Stores only an application)

31 of 50

Example Protocol for At least Once RPC (Lab 1 Part 2)

Messages:

Request Message

Source: Clients
Destination: Servers
Contents

Command to be executed
A sequence number

When is it sent?

Whenever a client wants to invoke a command
Client sets its current request to this message, its last reply to null, and then sends this message to the server
Client sets the client timer to resend the message

What happens at the destination when it is received?

The server passes the command to the server’s application and executes it
The server takes the result from the application, wraps it in a Reply message and sends it to the client that sent the Request message

32 of 50

Example Protocol for At least Once RPC (Lab 1 Part 2)

Messages:

Reply message

Source: Servers
Destination: Clients
Contents:

The reply from the application
A sequence number (integer)

When is it sent?

When a server receives a request message, it executes the command in the request and then responds to the client with a Reply message

What happens at the destination when it is received?

The client checks if the Reply message corresponds to the Request it is currently working on.
If it matches, the client sets it’s Reply field to this reply

33 of 50

Example Protocol for At least Once RPC (Lab 1 Part 2)

Timers:

RequestRetransmit

Set by clients
Contents: a sequence number (integer)
Set whenever a client sends a new RPC
What happens when it fires?

The client checks if the timer's sequence number is the same as the sequence number of the current request on the client. If it isn’t, ignore the timer.
Otherwise, the client retransmits the current request, and resets the timer

34 of 50

Where to Start

Messages

Request / Reply
Forward msg / forward msg Ack between P and B
State transfer / State transfer Ack
Others?

Timer handlers

Ensure proper checks on timer handler, avoid calling set() if response to the message that the timer was set for was received successfully

States needed to keep for PBClient/PBServer

AMOApplication (only server)
Sequence number (on client)
Current View

35 of 50

Lab 2 Debugging Tips

36 of 50

General tips

Primary should send commands to backup, backup should process command; send response to primary; primary processes the command; then the primary responds to client

What should primary do if it gets a request for a new command before it receives an acknowledgement for the previous command from the backup?

Dropping requests is simplest option
We recommend just processing one request at a time

Ensure primary and backup agree on current view

Send view with every message and check viewNum on each receive
Non primary should reject any requests!
Simplify states as much as possible, and don’t use more timers than necessary

37 of 50

State Transfer - Things to keep in mind

If primary receives a new view with a backup - need to do state transfer
Include all data (gets, puts, appends) and all RPC history

Sending entire AMOApplication in a message should be sufficient

Can’t process any requests while state transfer in progress (why?)

Primary should drop requests while state transfer in progress; client will retry

Backup can receive duplicated/late state transfer messages �(i.e. if state transfer message is duplicated/come later).

Ensure that state on backup is only overwritten once per view change
What happens if the ack gets dropped?

Usual retry logic applies

State transfer messages and acks can be dropped

Pings from primary during state transfer should reflect old view.

So the primary only moves to new view once state transfer is complete.

38 of 50

Additional Lab 2 hints/help

Follow tips on slides

Especially slides about processing multiple requests & processing requests during state transfer

Make sure that the view server can only move to a new view after the current primary acks the current view
Add timers only when necessary, i.e. if things need to be retried
You might want to run the test suite multiple times in case there are some infrequent errors
You might need to tune your timers, since it takes time to process messages, e.g. 10ms is generally unreasonable

39 of 50

Even more Lab 2 hints

Don't need 'curr view' and 'next view' in PBServer

in handleViewReply, we don't care what the node WAS, we only care about what it is becoming in the new view, so it’s unnecessary to store more than 1 view

Consider pinging with old view or not pinging at all during state transfer
On client: easiest to ask for current view whenever there is a timeout
Things that can increase the number of states to explore:

Every state transition: setting new timers, sending new messages
Unnecessary state information, e.g. a retry counter that just keeps incrementing

40 of 50

Some debugging tips

	Run	Search
Invariant Violations	-g FINER &> log.txt → might need to print out multiple runs if issue does not appear every time (printlns also helpful)	Visual Debugger → retrace steps that led to the invariant violation
Liveness Violation or Timeout	-g FINER &> log.txt → look for patterns in log file (printlns also helpful)	Visual Debugger → try to drive system towards goal (under the constraints of the test)

41 of 50

Some debugging tips

To debug things taking too long, you can run it with logging (-g FINER or -g FINEST) and write it to a file using &>

Look for stretches of text where the system makes no progress, e.g. repeated timers/handlers and no replies

To add your own logging

import lombok.extern.java.Log
annotate class with @Log
LOG.info(“Some message here”) or LOG.fine(“Some message here”), and use -g FINE or -g INFO to see your logs

Why log rather than println? Easy to turn off logging in search tests.

Write clean-ish code!

Commenting/refactoring your code can be pretty annoying but can also be super useful when you’re looking back and trying to debug code you wrote a week ago (or even a day ago in some bad cases lol)

42 of 50

Run Test Debugging: Logging

What to avoid: “Hello”, “HelloHi”, & “HelloHiBye” …
Idea is to glance at the log and get a general idea of system state
Make a logging function that systematically outputs the log with lots of information

Example: “<method_name>, <line_number>, <sender>, <receiver>, <your_custom_msg>”
Produces output: “OnHandleRequest, Line 143, Server1, Client3, Received requests from client 3”
Adding timestamps can be helpful too

Notice the “,”? You can save log as a “.csv” file and export it as a spreadsheet. (Alternatively separate by tabs and save as a .tsv)

Easy Searching
Filter rows based on constraints
Changing Layout
Conditional Row Coloring
… You get the idea :)

Bonus: Check out Java Logger Interface if you want to create custom loggers. Can do even more … like having custom logging levels and custom tags

43 of 50

Search Debugging: Invariant Violations

Don’t use logging (-g FINEST) or printlns for the search tests

The search tests do a BFS/DFS through search states, so you may see inconsistent messages

To debug invariant violations:

Open the trace in the debugger, read the invariant tooltip (if present) to see what went wrong.
Step through the trace to see how the invariant was violated
Determine if it’s an issue in your design or an inconsistency between your design and implementation

A common error is forgetting that the network can duplicate and reorder messages

In the debugger, you can look at the app state on the servers to see if something is amiss

44 of 50

Search Debugging: Liveness Violations

To debug liveness violations (“could not find state matching ...”):

Read the test (optional, but useful).

Find which search fails, and go to that particular bfs within the test.
Read the search settings, focus on the network settings.

In the debugger, read the goals and prunes of the test.
Figure out how you should be able to reach the goal within the restrictions of the search.
Take steps in the debugger to reach the goal.

If you can’t reach the goal: either need to fix the protocol (the idea of what should happen) or the implementation (the idea of what should happen doesn’t match what’s in your code).
If you can reach the goal, modify the code so that the search space is easier to explore.

make the search space smaller
make the trace shorter

45 of 50

Search Debugging: Liveness Violations

Make the search space smaller

Run the test with checks

Fix all equality issues (add @Data or replace types which don’t override equals) and non-determinism.
See if idempotency issues need to be fixed. Some non-idempotence is OK.

Ensure that retried messages are identical, and avoid retry counts
Confirm that timer queues don’t grow without bound (don’t set the same timer twice in the same handler).
For resend/discard timers, make sure that old timers are dropped.

Make the trace shorter

Common mistake: set a timer, wait for the timer to fire, then send some message.

Instead: send the message immediately, then set the timer

46 of 50

Other common issues

Think about what should happen when you get duplicated/delayed messages:

State Transfers
State Transfer acks
Forward Requests
Forward Replies
View Replies

Just because you pass the ViewServer tests, it doesn’t mean it’s 100% right

You shouldn’t be going from (primary, null) -> (null, null)

47 of 50

Pro Tips

1. If you’re failing test 2.19 and only test 2.19, try removing maps in the view server if you have any

Could not find state matching "All clients' workloads finished"

48 of 50

Some other tips for search tests

This means you’re generating too many states
Might be due to not having a clone/hashCode/equals

Try running --checks

Fix any equals or hashCode errors
Sometimes it’s okay for things to not be idempotent. It’s better if they are, but sometimes it’s okay.

Possibly due to receiving multiple copies of a message producing multiple messages/timers

E.g. getting the same handleViewReply produces multiple state transfer timers/messages

Servers should ping the ViewServer to get a view

Could not find state matching "All clients' workloads finished"

49 of 50

More Tips

If you get “not all clients’ results are the same”

Make sure that state transfers only happen once per view
Make sure that you only handle a single request at a time

You should check the forward replies and make sure that it’s for the same command as the one you’re waiting for

GetResult(“x”, “xy”)

GetResult(“x”, “yx”)

50 of 50

If you’re timing out on test 2.14/2.15

Try putting print statements on view changes and when you handle results on the clients as a point of departure for debugging

You can reply right away to alreadyExecuted commands, so you don’t need to forward it to the backup and get an ack back