1 of 78

Lab 4 Parts 1 & 2

CSE 452 Spring 2026

2 of 78

Announcements

  • Pset 5, Lab 3 due tomorrow (5/15)
  • Lab 4 part 2 design doc due next Friday (5/22)
  • Review syllabus for late day policy
    • 48 hr grace period for lab
    • 48 hr grace period for Pset
    • NO grace period for design doc

3 of 78

Lab 3 Wrap up

  • Lab 4 relies heavily on lab 3 (Paxos)
  • However, even if your lab 3 implementation isn’t perfect, you can still work on lab 4
    • But definitely work on finishing up lab 3 first! Especially any correctness bugs.
    • Some liveness issues may not trigger problems in lab 4.
    • ShardMaster (lab 4.1) doesn’t depend on Paxos
    • Some 4.2/4.3 tests use Paxos groups of size 1, so you can still pass those without a perfect Paxos implementation
      • Make sure you are passing test 27 from lab3
    • You can also debug the rest of the 4.2/4.3 tests by changing them to use Paxos groups of size 1, to test the actual lab 4 logic separately from the Paxos logic

4 of 78

Lab 4 Overview

Goal: Build a “linearizable, sharded key-value store with multi-key updates and dynamic load balancing, similar in functionality to Amazon's DynamoDB or Google's Spanner”.

Dynamic load balancing (with a shard controller) is Part 1

Sharding is Part 2 (and extending Part 1)

Multi-key updates (transactions and two-phase commit) is Part 3

5 of 78

Now, how do we do all of that?

  • Paxos increases reliability, sharding increases performance and scalability
  • Part 1 implements the ShardMaster application (handles load balancing)
  • Part 2 implements a Sharded KV store and handles moving shards
  • Part 3 adds multi-key transaction support
    • This means a single request can update multiple keys in different shards, while maintaining linearizability
    • Implemented using two-phase commit

6 of 78

Lab 4 Part 1 - ShardMaster

7 of 78

Sharding: what is it?

  • Divides keyspace (the K in K/V) into multiple groups, called shards
    • Can shard keys on many things (alphabetically, random/hashes, load-balanced etc.)
  • Each shard will be handled by a group of servers. Each group:
    • Runs Paxos from lab 3. So we can assume a group will not fail :)
    • Stores all key/value pairs in the database that correspond to its shard
    • Accepts/responds to client requests that correspond to its shard
  • Since different sharding groups can run in parallel without communicating, performance is increased proportional to the number of shards
  • Lab 4: The assignment of keys to shards is fixed; the lab asks you to do load balancing by assigning shards to groups of nodes.

8 of 78

Terminology

  • Shard Master
    • “Application” replicated by Paxos
    • Service that responds to changes in configuration (new Paxos groups being added, removed, etc)
  • Configuration:
    • Similar to a view in primary/backup lab 2
    • Specifies which groups are responsible for which shards
    • Has configuration number (monotonically increasing)
  • Paxos Replica Group
    • Group of servers performing Paxos agreement with each other - just like Lab 3
    • Handles key/value storage for assigned shards
  • Shard
    • In charge of a subset of key/value pairs,
      • e.g. shard that stores all keys starting with “a” or that stores all keys that start from “a-g”
    • Shards are numbered 1....numShards

9 of 78

Lab 4

Sharding/partitioning

Peer 0

Peer 1

Peer 2

Paxos Group 0

Peer 0

Peer 1

Peer 2

Paxos Group 1

Peer 0

Peer 1

Peer 2

Paxos Group 2

Put(A, 0)

Get(B)

Append(C, 123)

10 of 78

Shards vs Groups

Peer 0

Peer 1

Peer 2

Paxos Group 0

Peer 0

Peer 1

Peer 2

Paxos Group 1

Peer 0

Peer 1

Peer 2

Paxos Group 2

Shard 1

(a-d)

Shard 2

(e-h)

Shard 3

(i-l)

Shard 4

(m-p)

Shard 5

(q-t)

Shard 6

(u-x)

Shard 7

(y-z)

Each group of servers is in charge of shardNum(s)

11 of 78

ShardMaster

  • A service to keep track of which groups serve which shards
  • Necessary because:
    • Clients need to be able to figure out what group to send requests to (i.e. which replica group is responsible for a given key)
    • We might want to reconfigure the system (inducing redistribution of shards)
      • Add/Remove Paxos Replica Group
      • Move a shard to another group (testing or, in practice, load balancing popular keys)
  • Conceptually similar to the View Server in primary/backup

12 of 78

ShardMaster continued

  • Keeps track of a current configuration object (ShardConfig):
    • private final int configNum;
    • private final Map<Integer, Pair<Set<Address>, Set<Integer>>> groupInfo;
      • Integer: group id
      • Set<Address>: addresses of all members in that group
      • Set<Integer>: all the shard numbers the group is responsible for
  • Also remembers all old configurations
    • Does not need to be garbage collected
    • Query can ask for any past configurations (see slide 15)
    • For every configuration number, want to store a configuration object like above

13 of 78

ShardMaster Application

  • ShardMaster class is an Application
  • Accepts 4 command types:
    • Join
    • Leave
    • Move
    • Query
  • Responds with 3 reply types:
    • Ok
    • Error
    • ShardConfig
  • You’ll only need to call Query commands for the other parts of the lab (test code calls others for you)

14 of 78

Join

  • The way that new replica groups are added to the system
    • First Join’s config num should be INITIAL_CONFIG_NUM (0)
  • Join commands contain:
    • Integer for replica group ID (Must be unique, or ERROR is the result)
    • Set of server addresses that should be in the group
  • ShardMaster responds by creating a new configuration
    • New config includes new shard group
    • Redistributes the shards among the updated set of groups.
      • Should move as few shards as possible ← this may be a little bit tricky.
    • Returns Ok result

15 of 78

Leave

  • Command contains: Group Id that should “leave”
  • Opposite of Join: “deletes” a group from the system
  • ShardMaster must redistribute the group’s shards to other groups
  • Should still move as few shards as possible
  • OK on success, ERROR when
    • the current config does not contain group or
    • the final group is trying to leave (not actually tested)

16 of 78

Move

  • Command contains:
    • Shard number
    • Replica Group id (which group the shard should be moved to)
  • Moves a shard from one Paxos Replica Group to another Paxos Replica Group
  • Practically, helpful for load balancing in the real world - operations on really hot keys perhaps should be more isolated than other keys!
  • Returns OK on successful move, ERROR otherwise (e.g. when the current config does not contain the group or if it already has the shard)

17 of 78

Query

  • Command contains: configuration number
  • Should reply with ShardConfig
  • Returns configuration for a specific configuration number
    • e.g. a server is outdated and needs to catch up on all the missed configurations
  • If number is -1 or larger than largest known configuration number, the ShardMaster should reply with the latest configuration.

18 of 78

Join Example

Shard

1

2

3

4

5

6

7

8

9

10

Group

null

null

null

null

null

null

null

null

null

null

Configuration: -1 [should return Error if queried]

Join(1)

Worksheet part 1

19 of 78

Join Example

Shard

1

2

3

4

5

6

7

8

9

10

Group

1

1

1

1

1

1

1

1

1

1

Configuration: 0

Join(2)

20 of 78

Join Example

Shard

1

2

3

4

5

6

7

8

9

10

Group

1

1

1

1

1

2

2

2

2

2

Configuration: 1

Join(5)

21 of 78

Join Example

Shard

1

2

3

4

5

6

7

8

9

10

Group

1

1

1

5

5

2

2

2

2

5

Configuration: 2

22 of 78

Some series of transitions occur...

23 of 78

Leave Example

Shard

1

2

3

4

5

6

7

8

9

10

Group

1

1

5

2

2

7

7

4

6

6

Configuration: 9

Leave(1)

24 of 78

Leave Example

Shard

1

2

3

4

5

6

7

8

9

10

Group

5

4

5

2

2

7

7

4

6

6

Configuration: 10

25 of 78

Some tips

  • Remember to make deep copies of configurations if you’re going to edit them to create the next configurations

26 of 78

Lab 4 Part 2 - Replication for Sharded K/V

27 of 78

Read the spec!

28 of 78

Part 2 in some bullet points

  • Replicate messages
  • Move shards around to transition into new configs
  • Handle requests

29 of 78

As a note, just because you pass the part 1 tests, it doesn’t mean that the ShardMaster is 100% correct. ��100% code coverage is hard.

30 of 78

Overview: System

Paxos Replica Group 1

ShardStoreServer

ShardStoreServer

ShardStoreServer

Paxos Replica Group 2

ShardStoreServer

ShardStoreServer

ShardStoreServer

Paxos Replica Group 3

ShardStoreServer

ShardStoreServer

ShardStoreServer

ShardMaster

Client

Client

Client

PaxosRequest: Please give me the configuration

Note: Query is a Command

ShardStoreRequest: Paxos Replica Group 2, please append(“cat”, “concatenate files and print to stdOut”)

This is running PaxosServers with a ShardMaster app

1A

1B

2: “Client requests” -> shard’s group of servers

3: Paxos within group

4: Servers will execute eventually after consensus + reply

31 of 78

Overview: ShardMaster

Paxos Group Responsible for ShardMaster

P

P

P

P

P

ShardMaster is an Application that is running within a PaxosServer:

new PaxosServer(address, shardMasters.clone(), new ShardMaster(numShards));

32 of 78

Overview: Paxos Replica Group

Paxos Replica Group

ShardStoreServer

ShardStoreServer

ShardStoreServer

P

P

P

A zoomed in version of a Paxos Replica Group, where P stands for Paxos Subnode. Lines between ShardStoreServer and its P subnode indicate message passing between the two.

33 of 78

Message Passing between ShardStoreServer and PaxosSubnode

ShardStoreServer

PaxosSubnode

PaxosRequest

PaxosDecision

When executing, pass up PaxosDecision using handleMessage(decision, supernodeAddress) instead of doing app.execute. [Still in order.]

decision = what to process next on ShardStoreServer, e.g. reconfig or put or something

(Almost) everything should be Paxos replicated, so wrap stuff in a PaxosRequest and call handleMessage(request, paxosSubnode)

If leader, propose to other Paxos nodes

Gets decision or learns it’s chosen

34 of 78

Overview: Paxos Replica Group’s Subnode

public void init() {

// Setup Paxos

paxosAddress = Address.subAddress(address(), PAXOS_ADDRESS_ID);

Address[] paxosAddresses = new Address[group.length];

for (int i = 0; i < paxosAddresses.length; i++) {

paxosAddresses[i] = Address.subAddress(group[i], PAXOS_ADDRESS_ID);

}

PaxosServer paxosServer =

new PaxosServer(paxosAddress, paxosAddresses, address());

addSubNode(paxosServer);

paxosServer.init();

...

}

Init for ShardStoreServer

Code is also in spec

This won’t work until you add the constructor, see next slide

35 of 78

Constructing the Paxos Sub Node

public PaxosServer(Address address, Address[] servers, Address parentAddress) {

super(address); // 'address' is the address of this node

this.servers = servers;

this.parentAddress = parentAddress;� // 'parentAddress' is the address of the 'parent' ShardStoreServer

// Again, just call handleMessage(decision, this.parentAddress);

// Note: There is no app.

}

36 of 78

Changes for PaxosServer

  • Add the constructor
  • Basically add handleMessage(PaxosDecision, parentAddress) everywhere that executes
  • You might want to be able to repropose Query commands (or you can perform stale reads for Query, but ensure that you only perform stale reads if the sender allows it – we will cover this next week).

37 of 78

ShardStoreClient

  • Sends requests to ShardStoreServers
  • Similar to your client from lab2 (Primary/Backup/Viewserver)
  • Needs to get configuration from ShardMaster to be able to know to where to send requests
    • Tip: Use the existing broadcastToShardMasters(...) defined in ShardStoreNode to broadcast a Query to get current configuration
  • Response from shard masters will be a PaxosReply → need a handlePaxosReply in client to get updated ShardConfig
  • If client times out or receives an error message → client out of date, need to get updated config
  • For individual KVStoreCommands [Puts, Gets, Appends], wrap them in a ShardStoreRequest and broadcast to all servers in the correct replica group (maybe define a getGroupIdForShard(...) method in the config and a keyToShard(...)* method is provided for you to figure out which servers to broadcast to)

38 of 78

Messages

  • In ShardStoreServer, you receive:
    • PaxosReplys from the ShardMaster, informing you about new configs.
      • You will be pinging the ShardMaster every so often for new configurations (via Query)
    • ShardStoreRequests from clients asking you to perform a SingleKeyCommand (The commands you are used to in Lab 2/Lab 3) or Transaction (Lab 4 part 3)
    • ShardMoves/ShardMoveAcks* from other ShardStoreServers
    • PaxosDecisions from the Paxos subnode - the replicated messages
  • When a ShardStoreServer has received one of these messages, it CANNOT act on them until they are replicated (unless app has executed already, in which case it’s still safe to reply, as before)
    • All ShardMoves and SingleKeyCommands need to be serialized on each shard, despite failures

39 of 78

Suggestion for implementation

  • Commands received by a ShardStoreServer will usually need to be sent to Paxos to get a consensus
    • Basic structure will look something like this:

handleShardStoreRequest(Request m) {

// Do some validation (e.g. is the request being sent to the correct group, etc...)

// If valid, call process (coming in a slide or two) with replicated as false

}

  • Since this can can get a bit difficult to keep straight, we recommend the following approach instead...

40 of 78

Suggestion for implementation (continued)

  • Have a single generic “process” method that takes (Command c, boolean replicated)
    • Process method does “instanceof” on the command and calls a specific processXX where XX is the specific command type (e.g. AMOCommand, ShardMove, etc...)
    • When you receive a Paxos decision, just send it into this switch table, which will send it to the appropriate processXX method. See the next slide.
  • Each message handler will only call process method
    • E.g. handleShardStoreRequest(ShardStoreRequest r, ...) calls process with the message’s command and false for replicated (since command has not been replicated by Paxos)
    • On PaxosDecision handler, also call process with command but replicated would be true instead
  • Then in the individual processXX methods you can do something like…

41 of 78

Suggestion for implementation (continued)

private void process(Command command, boolean replicated) {

if (command instanceof ShardMove) {

processShardMove((ShardMove) command, replicated);

} else if (command instanceof ShardMoveAck) {

processShardMoveAck((ShardMoveAck) command, replicated);

} else if (command instanceof NewConfig) {

processNewConfig((NewConfig) command, replicated);

} else if (command instanceof AMOCommand) {

processAMOCommand(command, replicated);

}

// Add cases for Lab 4 Part 3

else {

LOG.severe("Got unknown command: " + command);

}

}

private void processShardMove(ShardMove m, boolean replicated) {

// some checks to see if this is valid command

if (!replicated) {

// propose via paxos. process will call this method again

// with replicated=true from handlePaxosDecision

// Propose m to Paxos subnode (Some wrapping may be required)

return;

}

// the message was replicated can now act upon it

// do stuff here

}

private void handleShardMoveMessage(ShardMoveMessage m, Address sender) {

process(m.command(), false /* It’s not replicated yet… */ );

}

private void handlePaxosDecision(PaxosDecision p, Address sender) {

process(p.command(), true /*It’s replicated (from Paxos) */ );

}

42 of 78

Suggestion for implementation (continued)

processXX(Message m, Address a, boolean replicated) {

// ...

// What if this command has been executed?

// What if this is not an active Config?

// ...

if (!replicated) {

paxosPropose(m.command());

return;

}

// Yay, it’s replicated! Now we can act upon the message, here...

}

Idea: For a given message, you will go into the processXX method twice (once before the message is replicated (in which case you will propose it via Paxos and then return) and the second time after the message has already been replicated via Paxos

43 of 78

Reconfiguration and More

44 of 78

Overview: System

Paxos Replica Group 1

ShardStoreServer

ShardStoreServer

ShardStoreServer

Paxos Replica Group 2

ShardStoreServer

ShardStoreServer

ShardStoreServer

Paxos Replica Group 3

ShardStoreServer

ShardStoreServer

ShardStoreServer

ShardMaster

Client

Client

Client

PaxosRequest: Please give me the configuration

Note: Query is a Command

PaxosRequest: Please give me the most recent configuration

Note: Query is a Command

configNum 5

G1: shard 1, 2, 3

G2: shard 4, 5, 6

configNum 6

G1: shard 1, 2

G2: shard 4, 5

G3: shard 3, 6

45 of 78

Reconfiguration

  • ShardStoreServer should not move on when it’s transitioning between configurations. Think back to state transitioning in PBServer
    • This is also true about reconfigurations, you don’t want to process any key/transaction commands until you have an active configuration!
  • How do you make sure all ShardStoreServer in a Paxos Group know about the new config?
    • Only one ShardStoreServer might hear about the new config-> but we need all nodes to know about it-> Use Paxos for consensus!
    • You don’t want a single ShardStoreServer to be able to send a Shard or receive (acknowledge) a Shard by itself - it must perform Paxos consensus

46 of 78

What does a configuration need to move?

  • From the spec:

Be careful about when guaranteeing at-most-once semantics for key-value operations. When a server sends shards to another, the server needs to send AMOApplication state as well. Think about how the receiver of the shards should send its AMOApplication state.

  • Our KVStore AMOApplication is more than just a KVStore - it also needs <Client, SeqNum> mappings so that it doesn’t repeat operations!
  • Solution 1: In ShardMove, send your Map<Client, SeqNum> mappings and keys associated with the moved shards. In handleShardMove, just update your reply mappings for that client to the maximum SeqNum.
  • Solution 2: Store one AMOApplication per shard to make moving shards easier in a �Map<Integer, AMOApplication>

47 of 78

Why Can’t This Happen?

Config 10

Group 1: shards 1,2,3

Group 2: shards 4,5,6

Group 3: shards 7,8,9,10

Config 11

Group 1: shards 2,3,4 (sent shard 1, received shard 4)

Group 2: shards 5,6 (sent shard 4)

Group 3: shards 8,9,10 (sent shard 7)

Group 4: shards 1,7 (received shards 1 and 7)

48 of 78

There is an example of what can go wrong on the slides

Servers should not query with (-1)

Instead, make sure your servers query with (config num + 1)

49 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 1

(0, [1,2,3,4], [], {}, false)

Group 2

(0, [], [], {}, false)

ShardMasters

Note that messages need to be replicated for the state to be consistent among the servers in the groups.

(config num, shards owned, shards needed, shards to move, inReconfig)

Note: Shards created in first config

50 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 1

(0, [1,2,3,4], [], {}, false)

Group 2

(0, [], [], {}, false)

ShardMasters

(config num, shards owned, shards needed, shards to move, inReconfig)

Query(1)

51 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 1

(0, [1,2,3,4], [], {}, false)

Group 2

(0, [], [], {}, false)

ShardMasters

NewConfig

(Group 1, 1, {1:[1,2], 2:[3,4]})

(config num, shards owned, shards needed, shards to move, inReconfig)

52 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 1

(1, [1,2], [], {2:{3,4}}, true)

Group 2

(0, [], [], {}, false)

ShardMasters

(config num, shards owned, shards needed, shards to move, inReconfig)

53 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 1

(1, [1,2], [], {2:{3,4}}, true)

Group 2

(0, [], [], {}, false)

ShardMasters

(config num, shards owned, shards needed, shards to move, inReconfig)

ShardMove(Group 2, 1, {3,4})

54 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 1

(1, [1,2], [], {2:{3,4}}, true)

Group 2

(0, [], [], {}, false)

ShardMasters

(config num, shards owned, shards needed, shards to move, inReconfig)

55 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 1

(1, [1,2], [], {2:{3,4}}, true)

Group 2

(0, [], [], {}, false)

ShardMasters

(config num, shards owned, shards needed, shards to move, inReconfig)

Query(1)

56 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 2

(0, [], [], {}, false)

ShardMasters

NewConfig

(Group 2, 1, {1:[1,2], 2:[3,4]})

(config num, shards owned, shards needed, shards to move, inReconfig)

Group 1

(1, [1,2], [], {2:{3,4}}, true)

57 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 2

(1, [], [3,4], {}, true)

(config num, shards owned, shards needed, shards to move, inReconfig)

ShardMasters

Group 1

(1, [1,2], [], {2:{3,4}}, true)

58 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 2

(1, [], [3,4], {}, true)

(config num, shards owned, shards needed, shards to move, inReconfig)

ShardMasters

Group 1

(1, [1,2], [], {2:{3,4}}, true)

ShardMove(Group 2, 1, {3,4})

onShardMoveTimer(1, {2:{3,4}})

59 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 2

(1, [3, 4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

ShardMasters

Group 1

(1, [1,2], [], {2:{3,4}}, true)

60 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 2

(1, [3, 4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

ShardMasters

Group 1

(1, [1,2], [], {2:{3,4}}, true)

ShardMoveAck(Group 1, 1, Group 2)

61 of 78

Reconfiguration

Might be (slightly) different if you’re doing transactions (part 3)!

Group 2

(1, [3, 4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

ShardMasters

Group 1

(1, [1,2], [], {}, false)

62 of 78

Reconfig Deadlock-ish Problem

ShardMasters

Group 1

(1, [1,2], [], {}, false)

Group 2

(1, [3,4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

Query(-1)

63 of 78

Reconfig Deadlock-ish Problem

ShardMasters

Group 1

(1, [1,2], [], {}, false)

Group 2

(1, [3,4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

NewConfig

(Group 1, 2,

{1:[1,2,3], 2:[4]})

64 of 78

Reconfig Deadlock-ish Problem

ShardMasters

Group 1

(2, [1,2], [3], {}, true)

Group 2

(1, [3,4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

65 of 78

Reconfig Deadlock-ish Problem

ShardMasters

Group 1

(2, [1,2], [3], {}, true)

Group 2

(1, [3,4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

Query(-1)

66 of 78

Reconfig Deadlock-ish Problem

ShardMasters

Group 1

(2, [1,2], [3], {}, true)

Group 2

(1, [3,4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

NewConfig

(Group 2, 3,

{1:[1,2], 2:[3,4]})

67 of 78

Reconfig Deadlock-ish Problem

ShardMasters

Group 1

(2, [1,2], [3], {}, true)

Group 2

(3, [3,4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

Stuck in reconfig because Group 2 won’t send it shard 3 now :(

68 of 78

Reconfig Deadlock-ish Problem

ShardMasters

Group 1

(2, [1,2], [3], {}, true)

Group 2

(3, [3,4], [], {}, false)

(config num, shards owned, shards needed, shards to move, inReconfig)

Solution: servers should query with (config num + 1) rather than (-1)

69 of 78

Thus, Clients and Servers should only accept increasing configuration numbers

Clients want the most up-to-date state: Query(-1)

Servers want the state for the next configuration that they can transition to, if there is one

  • Query for the next configuration: Query(currentConfigNum + 1)

70 of 78

Reconfiguration: A Potential Workflow

  1. New config received from a PaxosReply from pinging the ShardMaster group
  2. Replicate this into the paxos log (if !replicated { propose; }), and then continue when this operation is replicated
    1. You will have to create a new command to wrap the config in to pass it to Paxos
  3. Send Shards to the Paxos Replica Group responsible for the Shard in the new configuration
    • Retry sending these shards until you receive an Ack (also need to replicate Ack received in paxos log). Don’t resend to groups for which you have received an Ack
    • The shard group receiving the ShardMove request should put it in its log to replicate it. Then it can respond with an Ack.* What if the shard comes from the future, i.e. with a higher configNum than what is currently known?
  4. Receive Shards from the Paxos Replica Group responsible for your new Shards in the new configuration
  5. Wait until ShardAcksNeeded and ShardMovesNeeded are both empty/0
    • Helpful utility method: Given a groupId, an old Config, and a new Config, what shards need to be moved? This can be used for both finding out what ShardMoves and ShardMoveAcks you need to wait to receive
  6. Apply the ShardMove operations you have received when you are moving to a new config.
    • KVStore data updates and <Client, SeqNum> AMOApp updates

71 of 78

Maybe: Storing Heard Requests during a Reconfig

...

NewConfig

Command C

ShardMove

Command C

Gets rejected because it’s in a reconfig

Enters reconfig

Gets rejected because it’s already in the log*

Finished reconfig

Consider the case when

Maybe try to store replicated Commands you receive while you’re doing a reconfig and iterating over them when you’re done with the reconfig. (Make sure they get processed in the same order.)

*=would be reproposed after garbage collection

72 of 78

Sending messages between groups

  • You can tag shard moves and shard move acks with groupId information or the set of addresses to figure out who to send the result to, makes it easier if you get a lot of messages from older configurations

73 of 78

Misc. Tips

  • If you don’t pass 16,17,18 from Lab 3, you might want to consider going back to fix it if you’re failing tests 2.6-2.9 on Lab 4
  • Every operation (such as reconfig) that still happens if the current node fails needs to be adopted by Paxos; otherwise can lead to inconsistent state on the servers.
  • Each of the ShardStoreServers in a group can start the timer at a different time, so if you change state based on a timer, then the ShardStoreServers might diverge in terms of behavior
  • You need to define PaxosDecision in its own file so that it can be accessed by other packages

74 of 78

Some spec clarifications

“You may have implemented optimizations in lab 3 by making assumptions which were valid but do not hold for this lab. In particular, you should be very cautious about dropping proposals when Paxos is running as a sub-node. As a sub-node, Paxos should be oblivious to AMOApplication logic and should be able to decide same command for different slots. Some de-duplication at the PaxosServer level is possible, but it must be done carefully.”

  • You don’t have an app anymore, so your already executed checks might throw an error
  • Some filtering should be done by the ShardStoreServer before being passed down to the Paxos subnode, e.g.
    • already executed
    • whether the server is performing a reconfiguration
  • Largely depends on implementation
  • What happens to a client request that gets paxos replicated right before a decision is made and processed for a reconfig?

75 of 78

A note on read-only commands

  • IMPORTANT: Clients send Query(-1) over and over but want new answers each time.
    • Option 1: give each Query(-1) message a distinct sequence number (may blow up search tests)
    • Option 2: hack a way for some read only commands not to need a sequence number, and to be "never duplicated" (see speaker notes)
    • Option 3: use non-linearizable/stale results (may work ok if you check the config number on the response carefully)
  • You can reply to read-only commands without taking a log slot if you do it carefully: see Section 4.5 in PMMC
    • Issue: how to make sure the read sees an up-to-date chosen value, as of when the client issued the request (or later) but before the reply to the client; higher than any in progress op
    • Solutions:
      • Poll the acceptors for the slot with highest accepted value; do the read after that slot is decided
      • Another: give the leader a lease - prevent another leader from updating concurrently
      • Another: hold the read until a new write request can be filled, and do the read before that write - since that confirms that you are still the (only) leader after the read operation starts
    • But we suggest not doing this, because it’s tricky to get right

76 of 78

Remember to remove the apps from the shards when you move them

Why? Consider when you move a shard and don’t remove the shard, then in a future configuration, you’re supposed to get the shard back. Your code might indicate that it doesn’t need to be transferred because you already have it.

77 of 78

Using the visualizer

See the bottom of the lab 4 spec for more details

./run-tests.py -l 4 -d NUM_GROUPS NUM_SERVERS_PER_GROUP NUM_SHARDMASTERS NUM_CLIENTS CLIENT_WORKLOAD [... CLIENT_N_WORKLOAD] [CONFIG_WORKLOAD]

78 of 78

Additional notes [not mentioned in spec]

  • configController which handles all of the configuration changes like Joins/Leaves/Moves is a PaxosClient