1 of 45

CS 186 Exam Prep 12

Consensus

2 of 45

Motivation for Consensus

  • Replication of data is great
    • Higher availability
    • Geographical locality
  • Replication of data is difficult
    • How to ensure consistent replicas?
    • How to handle crashes?

3 of 45

Consensus Setup

2

6

1

3

2

2

2

2

  • Nodes each propose one value
  • System must make sure node agree on one value
  • Why not always use one proposer’s value?
    • Single point of failure

4 of 45

Paxos System Model

  • System contains N nodes (aka servers)
  • Nodes communicate only by passing messages
  • Messages can take arbitrarily long to be delivered
    • Messages are not corrupted (see BFT Consensus for this)
  • Nodes fail by stopping
  • Paxos can achieve consensus for F failures if N >= 2F + 1

5 of 45

Requirements

  • Safety
    • “Nothing bad happens”
    • Only a value that has been proposed is chosen
    • Only a single value is chosen
  • Liveness
    • “Something good eventually happens”
    • Replicas eventually converge to an agreed upon value
  • If N < 2F + 1 then no guarantees
    • Too many failures for Paxos to properly work

6 of 45

Paxos Algorithm

  • Three phase algorithm in the ideal case
    • Phase 1: Elect Leader
    • Phase 2: Propose Ballot to Be Chosen
    • Phase 3: Broadcast Chosen Ballot’s Value
  • May take more than three phases if multiple nodes want to be leader

7 of 45

Paxos Intuition

  • If we satisfy Phase 2 (P2), then we can achieve safety
    • P2: If a ballot with value v is chosen, every higher number ballot that is chosen also has value v
    • P2_a: If a ballot with value v is chosen, every higher ballot id accepted by a participant has value v
    • P2_b: If a ballot with value v is chosen, every ballot with a higher ballot id issued by a proposer has value v
    • P2_c: For any ballot issued with id n and value v, there is a majority of participants such that either no participant has accepted any ballot id numbered less than n or v is the value of the highest-numbered ballot id accepted by any participant in the majority�
  • P2_c => P2_b => P2_a => P2 - just satisfy P2_c!

8 of 45

Phase 1 - Elect Leader

  • Proposer sends a PREPARE(ballot_id) message to all participants
  • If a participant has received a ballot with lower ballot_id, ignore this request
  • If a participant has already sent a ACCEPT with a lower old_ballot_id with value v, respond with a PROMISE(ballot_id, (old_ballot_id, v))
    • Necessary to enforce the invariant that any later ballot has the value of any already chosen ballot
  • If this is the highest ballot_id seen so far, reply with PROMISE(ballot_id). Now the participant promises to not ACCEPT any lower ballot_id.

9 of 45

Phase 1 - Elect Leader

  • If a majority of participants respond with a PROMISE message, then the proposer is the leader
  • Proposer must then propose the value of the highest number ballot id received in any of the PROMISE messages
    • Can’t necessarily propose any value because a value may have already been chosen
    • If no such value exists, then the proposer can propose any value they want

10 of 45

Phase 2 - Propose Ballot to Be Chosen

  • Leader sends PROPOSE(ballot_id, v) to all participants
    • Remember that v is not necessarily the value that the proposer initially wanted. It is the value determined after receiving the PROMISE messages from a majority of participants.
  • If a participant has received a ballot with higher ballot_id, ignore this request
  • Otherwise a participant sends an ACCEPT(ballot_id, v) to proposer

11 of 45

Phase 3 - Broadcast Chosen Value

  • If the proposer received an ACCEPT(ballot_id, v) message from a majority of participants, then the value v is chosen.
    • If no such majority exists, then the round failed and a new leader was chosen.
  • Broadcast COMMIT(ballot_id, v) to all participants and finish the round
  • Core result: if a majority of participants ACCEPT ballot_id, v in a round then
    • Any subsequent round chooses v as the consensus value
    • Or the round fails

12 of 45

Simple Example

PREPARE(1)

13 of 45

Simple Example

PREPARE(1)

PROMISE(1)

14 of 45

Simple Example

PREPARE(1)

PROMISE(1)

PROPOSE(1, “CS186”)

15 of 45

Simple Example

PREPARE(1)

PROMISE(1)

PROPOSE(1, “CS186”)

ACCEPT(1, “CS186”)

16 of 45

Simple Example

PREPARE(1)

PROMISE(1)

PROPOSE(1, “CS186”)

ACCEPT(1, “CS186”)

COMMIT(1, “CS186”)

17 of 45

Conflict Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROMISE(2)

18 of 45

Conflict Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

19 of 45

Conflict Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

PROPOSE(1, “CS186”) is ignored by participants!

PROPOSE(2, “CS286”)

20 of 45

Conflict Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

PROPOSE(2, “CS286”)

ACCEPT(2, “CS286”)

21 of 45

Conflict Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

PROPOSE(2, “CS286”)

ACCEPT(2, “CS286”)

COMMIT(2, “CS286”)

22 of 45

Conflict Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

PROPOSE(2, “CS286”)

ACCEPT(2, “CS286”)

COMMIT(2, “CS286”)

23 of 45

Livelock Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROMISE(2)

24 of 45

Livelock Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

25 of 45

Livelock Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

PROPOSE(2, “CS286”)

26 of 45

Livelock Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

PROPOSE(2, “CS286”)

PREPARE(3)

PROMISE(3)

27 of 45

Livelock Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

PROPOSE(2, “CS286”)

PREPARE(3)

PROMISE(3)

PROPOSE(3, “CS262A”)

28 of 45

Livelock Example

PREPARE(1)

PROMISE(1)

PREPARE(2)

PROPOSE(1, “CS186”)

PROMISE(2)

PROPOSE(2, “CS286”)

PREPARE(3)

PROMISE(3)

PROPOSE(3, “CS262A”)

PREPARE(4)

PROMISE(4)

And so on and so forth…

29 of 45

Fixing Livelock

  • Livelock arises when multiple proposers ping-pong their requests so that no progress is achieved even though many messages are sent
  • Dedicate a fixed leader to be the only node proposing values
    • Elect a leader either through Paxos or through another algorithm
    • Elect new leader when leader fails
    • Leads to higher performance since leader election phase of Paxos can be skipped
      • See Multi-Paxos for more

30 of 45

Paxos Logging

  • Participants log the ballot id and flush the log when they PROMISE
    • Necessary to keep track of highest ballot id it has promised to ignore lower ballots for
  • Participants log the ballot and flush the log when they ACCEPT
    • Necessary since a participant must keep track of the highest ballot it has ACCEPTed if it crashes and comes back online
  • Proposers log the ballot and flush the log when a majority of participants have sent an ACCEPT
    • Remember the chosen value

31 of 45

Why Does N >= 2F + 1 Work?

  • Any majority contains at least F + 1 nodes
  • Thus if F nodes fail, at least 1 correct node will remain for that majority
  • This node will propagate the chosen value when it sends PROMISE messages to other nodes in later rounds
  • Those proposers will then be forced to propose this chosen value

32 of 45

Why Does N >= 2F + 1 Work?

Majority in round that leads to a chosen value

33 of 45

Why Does N >= 2F + 1 Work?

Suppose F=2 nodes crash after consensus has been achieved

34 of 45

Why Does N >= 2F + 1 Work?

In the next round, one node from the previous majority will be in the new majority

This node will respond with PROMISE(ballot_id, (old_ballot_id, v))

35 of 45

Other Ways to Learn Chosen Values

  • Previously, described setup where proposer broadcasts a COMMIT message to all participants after its value is chosen
    • Pros: Requires O(N) messages to be sent with N participants
    • Cons: Low reliability and requires an extra round of communication
  • Alternatively, participants can broadcast to all other participants whenever it ACCEPTs a value
    • Pros: Requires one less round of communication and more reliable
    • Cons: Requires O(N^2) messages to be sent with N participants

36 of 45

COMMIT Example

PREPARE(1)

PROMISE(1)

PROPOSE(1, “CS186”)

ACCEPT(1, “CS186”)

COMMIT(1, “CS186”)

37 of 45

Broadcast ACCEPT Example

PREPARE(1)

PROMISE(1)

PROPOSE(1, “CS186”)

ACCEPT(1, “CS186”)

38 of 45

Replicated Databases

a

b

3

0

4

1

5

0

6

1

7

2

8

3

a

b

3

0

4

1

5

0

6

1

7

2

8

3

Partition into Tablets

39 of 45

a

b

3

0

4

1

5

0

6

1

7

2

8

3

Partition into Tablets

a

b

3

0

4

1

5

0

a

b

6

1

7

2

8

3

40 of 45

a

b

3

0

4

1

5

0

a

b

3

0

4

1

5

0

a

b

3

0

4

1

5

0

a

b

3

0

4

1

5

0

Replicate Tablet

US-EAST

US-WEST

EU-WEST

41 of 45

a

b

3

0

4

1

5

0

a

b

3

0

4

1

5

0

a

b

3

0

4

1

5

0

a

b

6

1

7

2

8

3

a

b

6

1

7

2

8

3

a

b

6

1

7

2

8

3

US-EAST

US-WEST

EU-WEST

Paxos group

Paxos group

42 of 45

Local Reads

A read for a specific tablet’s data can go to the nearest tablet’s replica

43 of 45

Replicated Writes

A write for a specific tablet initiates a Paxos protocol within the tablet’s Paxos replica group

Typically a distinguished leader begins the protocol. Clients send requests to the leader.

44 of 45

Worksheet

45 of 45

Attendance Link