1 of 82

Raft and Distributed Consensus

Michael

1

2 of 82

A Stateful Service

2

3 of 82

3

client

x:4

x←4

4 of 82

4

client

x:4

ok

5 of 82

High Availability

5

6 of 82

6

S

client 1

M

S

client 2

7 of 82

Challenges

7

8 of 82

Leader Who?

8

S

client 1

M

S

client 2

9 of 82

Brain Split

9

S

client 1

M

S

client 2

10 of 82

Raft

10

11 of 82

11

2

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

1

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client 2

3

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client 1

12 of 82

Three Parts of Raft

  • Leader Election
    • Raft is a leader based protocol.
    • How to pick a leader.
    • How to recover from crashes, choose a new leader.
  • Log Replication
    • Leader accepts client command, append to its own log.
    • Leader replicates log to all followers.
  • Safety
    • Restriction on Candidacy, only vote for candidate “better” than me.
    • When can we consider a log entry as committed?
    • Why is committed an important concept

12

13 of 82

Leader Election

13

14 of 82

Server States

14

Follower

Candidate

Leader

win election

no heartbeat

start/recover

discover higher term

randomized timeout

discover current leader or new term

passive but expects a leader

request_vote RPC

append_entries RPC

15 of 82

Terms/Quorum(Majority Vote)

15

Raft divide time into Terms

16 of 82

Terms

16

Each Term starts with an election

17 of 82

Terms

17

If election timeout, it is fine, just wait for a new one to start.

18 of 82

Election Algorithm

18

current_term += 1

vote for self

ask for vote from all peers

19 of 82

Election Outcome 1

19

20 of 82

Simple Election

20

F

F

F

5 ms

T=0

150 ms

T=0

50 ms

T=0

21 of 82

Simple Election

21

F

C

F

T=1

150 ms

T=0

50 ms

T=0

22 of 82

Simple Election

22

F

C

F

T=1

32 ms

T=1

172 ms

T=1

ok

ok

23 of 82

Simple Election

23

F

L

F

T=1

121 ms

T=1

188 ms

T=1

24 of 82

Election Outcome 2

24

25 of 82

Learn About the Winner

25

F

F

F

5 ms

T=0

4 ms

T=0

231 ms

T=0

26 of 82

Learn About the Winner

26

F

C

C

112 ms

T=1

341 ms

T=1

226 ms

T=0

vote for the first request

27 of 82

Learn About the Winner

27

F

L

C

112 ms

T=1

341 ms

T=1

183 ms

T=1

no, I have voted for other node

ok

28 of 82

Learn About the Winner

28

F

L

C

112 ms

T=1

341 ms

T=1

183 ms

T=1

29 of 82

Learn About the Winner

29

F

L

F

T=1

177 ms

T=1

214 ms

T=1

30 of 82

Election Outcome 3

30

31 of 82

Split Vote

31

F

F

F

5 ms

T=0

4 ms

T=0

7 ms

T=0

32 of 82

Split Vote

32

C

C

C

217 ms

T=1

251 ms

T=1

152 ms

T=1

33 of 82

Split Vote

33

C

C

C

52 ms

T=1

242 ms

T=1

177 ms

T=1

34 of 82

Split Vote

34

C

C

C

T=2

190 ms

T=1

125 ms

T=1

35 of 82

Split Vote

35

F

L

F

T=2

153 ms

T=2

225 ms

T=2

ok

ok

36 of 82

Election Correctness

  • safe
    • bad thing will not happen
    • at most one leader will be elected for each term
  • liveness
    • good things will eventually happen
    • randomness makes it very unlikely but not impossible to split vote

36

37 of 82

Log Replication

37

38 of 82

Log Entry

38

3

x←4

4

index

term

cmd

39 of 82

Log Structure

39

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

leader

40 of 82

Replicate Log with AppendEntris Consistency Check

40

41 of 82

Replicate Log

41

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

match

leader

42 of 82

Replicate Log

42

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

match

3

x←3

8

leader

43 of 82

Replicate Log

43

3

x←3

8

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

✘mismatch

leader

44 of 82

Replicate Log

44

3

x←3

8

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

✘mismatch

leader

45 of 82

Replicate Log

45

3

x←3

8

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

✘mismatch

leader

46 of 82

Replicate Log

46

3

x←3

8

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

match

leader

47 of 82

Replicate Log

47

3

x←3

8

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

match

3

x←3

8

3

y←7

5

3

x←0

6

3

y←9

7

leader

48 of 82

Committed Entries

48

49 of 82

Committed Entries

49

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

leader

50 of 82

Committed Entries

50

server 1:

server 2:

server 3:

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

3

x←3

8

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

3

y←7

5

3

x←0

6

3

y←9

7

1

x←2

1

1

y←3

2

1

x←4

3

2

z←x

4

2

y←x

5

leader

  1. Leader’s term.
  2. Replicated to majority by the leader.

51 of 82

Safety

51

52 of 82

Election Restriction

52

53 of 82

53

54 of 82

54

Reject

55 of 82

One Edge Case

55

56 of 82

Fig 8

56

because S1 is at T4, it replicates the past term entry in its log at T2. It has not replicated the current term T4 entry successfully yet.

Remember committed(recompuated and start from 0 whenever a new leader is elected) needs two requirements:

  • current term
  • replicated to majority

57 of 82

Leader Completeness Property

57

58 of 82

Fig 9

58

59 of 82

Client Interaction

59

60 of 82

60

F

client 1

L

F

client 2

x←4

z←x

61 of 82

61

2

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

1

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client

3

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

z←x

62 of 82

62

2

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

1

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client

3

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

z←x

z←x

63 of 82

63

2

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

1

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client

3

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

z←x

z←x

64 of 82

64

2

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

1

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client

3

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

z←x

z←x

z←x

z←x

65 of 82

65

2

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

1

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client

3

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

z←x

z←x

z←x

z←x

66 of 82

66

2

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

1

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client

3

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

z←x

z←x

z←x

ok, 4

67 of 82

67

2

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

1

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

client

3

x←2

y←3

x←4

Log

Consensus

Module

App State

Machine

z←x

z←x

z←x

68 of 82

Run Away Example

68

69 of 82

Runaway Example

69

F

L

F

F

F

70 of 82

Case 1

70

71 of 82

Runaway Example: Case 1

71

F

F

F

L

F

72 of 82

Runaway Example: Case 1

72

F

L

F

L

F

73 of 82

Runaway Example: Case 1

73

F

L

F

L

F

who has a higher term ?

74 of 82

Case 2

74

75 of 82

Runaway Example: case 2

75

F

L

F

F

F

76 of 82

Runaway Example: case 2

76

F

L

F

C

C

Term 82

Term 3

77 of 82

Runaway Example: case 2

77

F

L

F

C

C

Term 82

Term 82

78 of 82

Runaway Example: case 2

78

F

F

F

C

C

Term 82

Term 82

vote for me, please

79 of 82

Runaway Example: case 2

79

F

F

F

C

C

Term 82

Term 82

reject, because log

80 of 82

Runaway Example: case 2

80

C

F

F

C

C

Term 82

Term 83

81 of 82

Runaway Example: case 2

81

L

F

F

F

F

Term 83

Term 83

ok

ok

ok

ok

82 of 82

Reference

  1. Designing for Understandability: The Raft Consensus Algorithm - YouTube
    1. Raft States 38:57
  2. Distributed Consensus with Raft - CodeConf 2016 - YouTube
    • The teaser problem
  3. Practical Distributed Consensus using HashiCorp/raft - YouTube
    • The teaser problem with many clients
  4. Lecture 6: Fault Tolerance: Raft (1) - YouTube
  5. Lecture 7: Fault Tolerance: Raft (2) - YouTube
  6. RustConf 2018 - Using Raft in Rust by Siddon Tang - YouTube
    • Examples of split brain

82