1 of 119

Distributed Mutual Exclusion and Deadlock Detection in Distributed Systems

Module 4

CET3019B 2023

2 of 119

Course�Contents

3 of 119

Outline

  • What is Mutual exclusion
  • Distributed Mutual Exclusion Algorithm
    • Token Based
    • Non Token Based
    • Quorum-based approach. => (Quorum: Subset of sites)
  • Centralized Mutual Exclusion Algorithm
  • Deadlocks in Distributed Systems

4 of 119

Mutual exclusion

  • Mutual exclusion is a fundamental problem in distributed computing systems.
  • Mutual exclusion ensures that concurrent access of processes to a shared resource or data is serialized, that is, executed in a mutually exclusive manner.
  • Mutual exclusion in a distributed system states that only one process is allowed to execute the critical section (CS) at any given time.
  • In a distributed system, shared variables (semaphores) or a local kernel cannot be used to implement mutual exclusion.
  • Message passing is the sole means for implementing distributed mutual exclusion.
  • The decision as to which process is allowed access to the CS next is arrived at by message passing, in which each process learns about the state of all other processes in some consistent way.

5 of 119

Distributed mutual exclusion algorithms

  • The design of distributed mutual exclusion algorithms is complex because these algorithms have to deal with unpredictable message delays and incomplete knowledge of the system state.
  • There are three basic approaches for implementing distributed mutual exclusion:

1. Token-based approach.

2. Non-token-based approach.

3. Quorum-based approach. => (Quorum: Subset of sites)

6 of 119

1. Token-based approach.

  • In the token-based approach, a unique token (also known as the PRIVILEGE message) is shared among the sites.
  • A site is allowed to enter its CS if it possesses the token and it continues to hold the token until the execution of the CS is over.
  • Mutual exclusion is ensured because the token is unique.
  • The algorithms based on this approach essentially differ in the way a site carries out the search for the token

7 of 119

Distributed mutual exclusion algorithms

  • The design of distributed mutual exclusion algorithms is complex because these algorithms have to deal with unpredictable message delays and incomplete knowledge of the system state.
  • There are three basic approaches for implementing distributed mutual exclusion:

1. Token-based approach.

2. Non-token-based approach.

3. Quorum-based approach. => (Quorum: Subset of sites)

8 of 119

2. Non-token-based approach.

  • In the non-token-based approach, two or more successive rounds of messages are exchanged among the sites to determine which site will enter the CS next.
  • A site enters the critical section (CS) when an assertion, defined on its local variables, becomes true.
  • Mutual exclusion is enforced because the assertion becomes true only at one site at any given time

9 of 119

Distributed mutual exclusion algorithms

  • The design of distributed mutual exclusion algorithms is complex because these algorithms have to deal with unpredictable message delays and incomplete knowledge of the system state.
  • There are three basic approaches for implementing distributed mutual exclusion:

1. Token-based approach.

2. Non-token-based approach.

3. Quorum-based approach. => (Quorum: Subset of sites)

10 of 119

3. Quorum-based approach

  • In the quorum-based approach, each site requests permission to execute the CS from a subset of sites (called a quorum).
  • The quorums are formed in such a way that when two sites concurrently request access to the CS, at least one site receives both the requests and this site is responsible to make sure that only one request executes the CS at any time.

11 of 119

Distributed mutual exclusion algorithms

  • The design of distributed mutual exclusion algorithms is complex because these algorithms have to deal with unpredictable message delays and incomplete knowledge of the system state.
  • There are three basic approaches for implementing distributed mutual exclusion:

1. Token-based approach.

2. Non-token-based approach.

3. Quorum-based approach. => (Quorum: Subset of sites)

12 of 119

Requirements of mutual-exclusion algorithms

A mutual exclusion algorithm should satisfy the following properties:

1. Safety property

The safety property states that at any instant, only one process can execute the critical section. This is an essential property of a mutual exclusion algorithm.

2. Liveness property

This property states the absence of deadlock and starvation.

Two or more sites should not endlessly wait for messages that will never arrive.

In addition, a site must not wait indefinitely to execute the CS while other sites are repeatedly executing the CS.

That is, every requesting site should get an opportunity to execute the CS in finite time.

3. Fairness Fairness

in the context of mutual exclusion means that each process gets a fair chance to execute the CS.

In mutual exclusion algorithms, the fairness property generally means that the CS execution requests are executed in order of their arrival in the system (the time is determined by a logical clock).

The first property is absolutely necessary and the other two properties are considered important in mutual exclusion algorithms.

13 of 119

Performance metrics

The performance of mutual exclusion algorithms is generally measured by the following four metrics:

• Message complexity

This is the number of messages that are required per CS execution by a site.

• Synchronization delay

After a site leaves the CS, it is the time required and before the next site enters the CS

• Response time

This is the time interval a request waits for its CS execution to be over after its request messages have been sent out

• System throughput

This is the rate at which the system executes requests for the CS. If SD is the synchronization delay and E is the average critical section execution time, then the throughput is given by the following equation

 

14 of 119

Low and high load performance

  • The load is determined by the arrival rate of CS execution requests.
  • Performance of a mutual exclusion algorithm depends upon the load and we often study the performance of mutual exclusion algorithms under two special loading conditions, viz., “low load” and “high load.”
  • Under low load conditions, there is seldom more than one request for the critical section present in the system simultaneously.
  • Under heavy load conditions, there is always a pending request for critical section at a site.
  • Thus, in heavy load conditions, after having executed a request, a site immediately initiates activities to execute its next CS request.
  • A site is seldom in the idle state in heavy load conditions. For many mutual exclusion algorithms, the performance metrics can be computed easily under low and heavy loads through a simple mathematical reasoning.

15 of 119

Best and worst case performance

  • Generally, mutual exclusion algorithms have best and worst cases for the performance metrics.
  • In the best case, prevailing conditions are such that a performance metric attains the best possible value.
  • For example, in most mutual exclusion algorithms the best value of the response time is a roundtrip message delay plus the CS execution time, 2T + E.
  • Often for mutual exclusion algorithms, the best and worst cases coincide with low and high loads, respectively.
  • For examples, the best and worst values of the response time are achieved when load is, respectively, low and high; in some mutual exclusion algorithms the best and the worse message traffic is generated at low and heavy load conditions, respectively.

16 of 119

Lamport’s algorithm

  • Lamport developed a distributed mutual exclusion algorithm as an illustration of his clock synchronization scheme.
  • The algorithm is fair in the sense that a request for CS are executed in the order of their timestamps and time is determined by logical clocks.
  • When a site processes a request for the CS, it updates its local clock and assigns the request a timestamp. The algorithm executes CS requests in the increasing order of timestamps.
  • Every site Si keeps a queue, request_queuei, which contains mutual exclusion requests ordered by their timestamps. (Note that this queue is different from the queue that contains local requests for CS execution awaiting their turn.) This algorithm requires communication channels to deliver messages in FIFO order

17 of 119

Lamport’s algorithm: Working

  • Requesting the critical section
  • Executing the critical section
  • Releasing the critical section

18 of 119

Lamport’s algorithm: Working

  • Requesting the critical section

  • Executing the critical section
  • Releasing the critical section

19 of 119

Lamport’s algorithm: Working

  • Requesting the critical section
  • Executing the critical section

  • Releasing the critical section

20 of 119

Lamport’s algorithm: Working

  • Requesting the critical section
  • Executing the critical section
  • Releasing the critical section

21 of 119

Lamport’s algorithm: Working

22 of 119

Lamport’s algorithm :Performance

  • For each CS execution, Lamport’s algorithm requires N − 1 REQUEST messages, N −1 REPLY messages, and N −1 RELEASE messages. Thus, Lamport’s algorithm requires 3N − 1 messages per CS invocation. The synchronization delay in the algorithm is T.

23 of 119

Lamport’s algorithm

24 of 119

Lamport’s algorithm

Req(Ts1,P3)

Req(Ts1,P3)

Queue (Ts1,P3)

Queue (Ts1,P3)

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue (Ts2,P1)

25 of 119

Lamport’s algorithm

Req(Ts1,P3)

Req(Ts2,P1)

Req(Ts2,P1)

Req(Ts1,P3)

Reply

Queue (Ts1,P3)

Queue (Ts1,P3)

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue (Ts2,P1)

26 of 119

Lamport’s algorithm

Req(Ts1,P3)

Req(Ts2,P1)

Req(Ts2,P1)

Req(Ts1,P3)

Reply

Reply

Queue (Ts1,P3)

Queue (Ts1,P3)

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue (Ts2,P1)

Queue [ (Ts1,P3)�(Ts2,P1) ]

27 of 119

Lamport’s algorithm

CS

Req(Ts1,P3)

Req(Ts2,P1)

Req(Ts2,P1)

Req(Ts1,P3)

Reply

Reply

Reply

Reply

Queue (Ts1,P3)

Queue [ (Ts1,P3)]

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue (Ts2,P1)

Queue [ (Ts1,P3), �(Ts2,P3) ]

Queue (Ts1,P3)

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue [ (Ts1,P3)�(Ts2,P1) ]

28 of 119

Lamport’s algorithm

CS

Req(Ts1,P3)

Req(Ts2,P1)

Req(Ts2,P1)

Req(Ts1,P3)

Release(Ts1,P3)

Release(Ts1,P3)

Reply

Reply

Reply

Reply

Queue (Ts1,P3)

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue (Ts2,P1)

Queue [ (Ts1,P3), �(Ts2,P3) ]

Queue (Ts1,P3)

Queue [ (Ts1,P3)(Ts2,P1) ]

Queue [ (Ts1,P3),(Ts2,P1) ]

Queue [ (Ts1,P3)�(Ts2,P1) ]

29 of 119

Lamport’s algorithm

CS

Req(Ts1,P3)

Req(Ts2,P1)

Req(Ts2,P1)

Req(Ts1,P3)

Release(Ts1,P3)

Release(Ts1,P3)

Reply

Reply

Reply

Reply

Queue (Ts1,P3)

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue (Ts2,P1)

Queue [ (Ts1,P3), �(Ts2,P1) ]

Queue [(Ts2,P1)]

Queue [ (Ts2,P1) ]

Queue [(Ts2,P1)]

Queue [ (Ts1,P3)�(Ts2,P1) ]

30 of 119

Lamport’s algorithm

CS

CS

Req(Ts1,P3)

Req(Ts2,P1)

Req(Ts2,P1)

Req(Ts1,P3)

Release(Ts1,P3)

Release(Ts1,P3)

Reply

Reply

Reply

Reply

Queue (Ts1,P3)

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue [ (Ts1,P3), �(Ts2,P1) ]

Queue [ (Ts2,P1) ]

Queue [(Ts2,P1)]

Queue [ (Ts2,P1) ]

Queue [ (Ts1,P3)�(Ts2,P1) ]

31 of 119

Lamport’s algorithm

CS

CS

Req(Ts1,P3)

Req(Ts2,P1)

Req(Ts2,P1)

Req(Ts1,P3)

Release(Ts1,P3)

Release(Ts1,P3)

Reply

Reply

Reply

Reply

Queue (Ts1,P3)

Queue [ (Ts1,P3)�(Ts2,P1) ]

Queue [ (Ts1,P3), �(Ts2,P1) ]

Queue [ (Ts2,P1) ]

Queue [(Ts2,P1)]

Queue [ (Ts2,P1) ]

Queue [ (Ts1,P3)�(Ts2,P1) ]

Release

Release

Queue [(Ts2,P3) ]

Queue [ (Ts2,P1) ]

Queue [ (Ts2,P1) ]

32 of 119

Lamport’s algorithm :An optimization

  • In Lamport’s algorithm, REPLY messages can be omitted in certain situations.
  • For example, if site Sj receives a REQUEST message from site Si after it has sent its own REQUEST message with a timestamp higher than the timestamp of site Si’s request, then site Sj need not send a REPLY message to site Si.
  • This is because when site Si receives site Sj’s request with a timestamp higher than its own, it can conclude that site Sj does not have any smaller timestamp request which is still pending (because communication channels preserves FIFO ordering).
  • With this optimization, Lamport’s algorithm requires between 3N − 1 and 2N −1 messages per CS execution.

33 of 119

The Ricart–Agrawala Algorithm

  • Assumes that the communication channels are FIFO.
  • The algorithm uses two types of messages: REQUEST and REPLY.
  • A process sends a REQUEST message to all other processes to request their permission to enter the critical section.
  • A process sends a REPLY message to a process to give its permission to that process.
  • Processes use Lamport-style logical clocks to assign a timestamp to critical section requests.

34 of 119

The Ricart–Agrawala Algorithm

  • Timestamps are used to decide the priority of requests in case of conflict – if a process pi that is waiting to execute the critical section receives a REQUEST message from process pj, then if the priority of pj’s request is lower, pi defers the REPLY to pj and sends a REPLY message to pj only after executing the CS for its pending request.
  • Otherwise, pi sends a REPLY message to pj immediately, provided it is currently not executing the CS.
  • Thus, if several processes are requesting execution of the CS, the highest priority request succeeds in collecting all the needed REPLY messages and gets to execute the CS.

35 of 119

The Ricart–Agrawala Algorithm

  •  

36 of 119

The Ricart–Agrawala Algorithm

    • (a) When a site Si wants to enter the CS, it broadcasts a timestamped REQUEST message to all other sites.
    • (b) When site Sj receives a REQUEST message from site Si, it sends a REPLY message to site Si if site Sj is neither requesting nor executing the CS, or if the site Sj is requesting and Si’s request’s timestamp is smaller than site Sj’s own request’s timestamp. Otherwise, the reply is deferred and Sj sets RDji = 1.

Requesting the critical section

Executing the critical section

Releasing the critical section

37 of 119

The Ricart–Agrawala Algorithm

Requesting the critical section

Executing the critical section

    • (c) Site Si enters the CS after it has received a REPLY message from every site it sent a REQUEST message to.

Releasing the critical section

38 of 119

The Ricart–Agrawala Algorithm

Requesting the critical section

Executing the critical section

Releasing the critical section

    • (d) When site Si exits the CS, it sends all the deferred REPLY messages: ∀j if RDij = 1, then sends a REPLY message to Sj and sets RDij = 0.

39 of 119

The Ricart–Agrawala Algorithm

CS

CS

Req(Ts1,P3)

Req(Ts2,P1)

Req(Ts2,P1)

Req(Ts1,P3)

Reply

Reply

Reply

DefferedQueue [ ]

Queue [ts2,P1]

DefferedQueue [ ]

Deferred Request Queue [ ]

DefferedQueue [ ]

DefferedQueue []

DefferedQueue [ts2,P1]

DefferedQueue []

Reply

40 of 119

Token-based algorithms

  • In token-based algorithms, a unique token is shared among the sites.
  • A site is allowed to enter its CS if it possesses the token.
  • A site holding the token can enter its CS repeatedly until it sends the token to some other site.
  • Depending upon the way a site carries out the search for the token, there are numerous token-based algorithms.

41 of 119

Token-based algorithms

  • Next, we discuss two token-based mutual exclusion algorithms.
  • Before we start with the discussion of token-based algorithms, two comments are in order.
  • First, token-based algorithms use sequence numbers instead of timestamps.
  • Every request for the token contains a sequence number and the sequence numbers of sites advance independently.
  • A site increments its sequence number counter every time it makes a request for the token. (A primary function of the sequence numbers is to distinguish between old and current requests.)
  • Second, the correctness proof of token-based algorithms, that they enforce mutual exclusion, is trivial because an algorithm guarantees mutual exclusion so long as a site holds the token during the execution of the CS.
  • Instead, the issues of freedom from starvation, freedom from deadlock, and detection of the token loss and its regeneration become more prominent.

42 of 119

Suzuki–Kasami’s broadcast algorithm

  • In Suzuki–Kasami’s algorithm, if a site that wants to enter the CS does not have the token, it broadcasts a REQUEST message for the token to all other sites.
  • A site that possesses the token sends it to the requesting site upon the receipt of its REQUEST message.
  • If a site receives a REQUEST message when it is executing the CS, it sends the token only after it has completed the execution of the CS.

43 of 119

Suzuki–Kasami’s broadcast algorithm

44 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

Any site that holds the token can enter into the critical section (CS)

45 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

  • Any site that holds the token can enter into the critical section (CS)
  • Suppose the site A has token it will execute the CS

46 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

  • Any site that holds the token can token, the critical section (CS)
  • Suppose the site A has token it will execute the CS
  • The token can be sent to token, site for example site B.
  • Upon receiving the token, the site B can execute the

47 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

  • Any site that holds the token can token, the critical section (CS)
  • Suppose the site A has token it will execute the CS
  • The token can be sent to token, site for example site B.
  • Upon receiving the token, the site B can execute the

48 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

  • Any site that holds the token can token, the critical section (CS)
  • Suppose the site A has token it will execute the CS
  • The token can be sent to token, site for example site B.
  • Upon receiving the token, the site B can execute the
  • Similar things will happen at site C as well.

49 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

  • Any site that holds the token can token, the critical section (CS)
  • Suppose the site A has token it will execute the CS
  • The token can be sent to token, site for example site B.
  • Upon receiving the token, the site B can execute the
  • Similar things will happen at site C as well.

50 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

  • Any site that holds the token can token, the critical section (CS)
  • Suppose the site A has token it will execute the CS
  • The token can be sent to token, site for example site B.
  • Upon receiving the token, the site B can execute the
  • Similar things will happen at site C as well.

  • Suzuki-Kasami’s broadcast algorithm deals with sharing the token on request basic

51 of 119

Suzuki–Kasami’s broadcast algorithm

A

B

C

D

Request Number Counter Array

A

D

C

B

52 of 119

Suzuki–Kasami’s broadcast algorithm

Request Number Counter Array

A

D

C

B

0

0

0

0

A

B

C

D

53 of 119

Suzuki–Kasami’s broadcast algorithm

Request Number Counter Array

A

D

C

B

1

0

0

0

A

B

C

D

54 of 119

Suzuki–Kasami’s broadcast algorithm

Request Number Counter Array

A

D

C

B

1

1

0

0

A

B

C

D

55 of 119

Suzuki–Kasami’s broadcast algorithm

Request Number Counter Array

A

D

C

B

0

0

0

0

A

B

C

D

56 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

57 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

58 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

59 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

60 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

(A,1)

(A,1)

(A,1)

(A,1)

(Process, RequestSequenceNo)

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

61 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

(A,1)

(A,1)

(A,1)

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

RNB[A]< (A,1)

RNC[A]< (A,1)

RND[A]< (A,1)

All nodes approve

62 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

1

0

0

0

1

0

0

0

1

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

(A,1)

(A,1)

(A,1)

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

RNB[A]< (A,1)

RNC[A]< (A,1)

RND[A]< (A,1)

All nodes approve

A

63 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

1

0

0

0

1

0

0

0

1

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

(A,1)

(A,1)

(A,1)

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

RNB[A]< (A,1)

RNC[A]< (A,1)

RND[A]< (A,1)

All nodes approve

If D has token and executing critical section it will place the request in the queue.

A

64 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

1

0

0

0

1

0

0

0

1

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

(A,1)

(A,1)

(A,1)

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

RNB[A]< (A,1)

RNC[A]< (A,1)

RND[A]< (A,1)

All nodes approve

If D has token and executing critical section it will place the request in the queue.

After the CS execution is over it remove from the queue add to RND[A] AND will transmit the token

A

65 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

1

0

0

0

1

0

0

0

1

0

0

0

0

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

(A,1)

(A,1)

(A,1)

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

RNB[A]< (A,1)

RNC[A]< (A,1)

RND[A]< (A,1)

All nodes approve

If D has token and executing critical section it will place the request in the queue.

After the CS execution is over it remove from the queue add to RND[A] AND will transmit the token

66 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

1

0

0

0

1

0

0

0

1

0

0

0

1

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

(A,1)

(A,1)

(A,1)

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

RNB[A]< (A,1)

RNC[A]< (A,1)

RND[A]< (A,1)

All nodes approve

If D has token and executing critical section it will place the request in the queue.

After the CS execution is over it remove from the queue add to RND[A] AND will transmit the token

67 of 119

Suzuki–Kasami’s broadcast algorithm

A

D

C

B

1

0

0

0

1

0

0

0

1

0

0

0

1

0

0

0

RNA

RNB

RNC

RND

Request Queue

Request Queue

Request Queue

Request Queue

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

RNB[A]< (A,1)

RNC[A]< (A,1)

RND[A]< (A,1)

All nodes approve

If D has token and executing critical section it will place the request in the queue.

After the CS execution is over it remove from the queue add to RND[A] AND will transmit the token

68 of 119

Suzuki–Kasami’s broadcast algorithm

69 of 119

Suzuki–Kasami’s broadcast algorithm

70 of 119

Suzuki–Kasami’s broadcast algorithm

71 of 119

Suzuki–Kasami’s broadcast algorithm

72 of 119

Suzuki–Kasami’s broadcast algorithm

73 of 119

Suzuki–Kasami’s broadcast algorithm

74 of 119

Suzuki–Kasami’s broadcast algorithm

75 of 119

Deadlocks

  • Deadlocks are a fundamental problem in distributed systems and deadlock detection in distributed systems has received considerable attention in the past.
  • In distributed systems, a process may request resources in any order, which may not be known a priori, and a process can request a resource while holding others.
  • If the allocation sequence of process resources is not controlled in such environments, deadlocks can occur.
  • A deadlock can be defined as a condition where a set of processes request resources that are held by other processes in the set

76 of 119

Dealing with Deadlocks

  • Deadlocks can be dealt with using any one of the following three strategies:
    • deadlock prevention,
    • deadlock avoidance, and
    • deadlock detection.
  • Deadlock prevention is commonly achieved by either having a process acquire all the needed resources simultaneously before it begins execution or by pre-empting a process that holds the needed resource.
  • In the deadlock avoidance approach to distributed systems, a resource is granted to a process if the resulting global system is safe.
  • Deadlock detection requires an examination of the status of the process–resources interaction for the presence of a deadlock condition.
  • To resolve the deadlock, we have to abort a deadlocked process.

77 of 119

System Model

  • A distributed system consists of a set of processors that are connected by a communication network.
  • The communication delay is finite but unpredictable. A distributed program is composed of a set of n asynchronous processes P1, P2, , Pi, , Pn that communicate by message passing over the communication network.
  • Without loss of generality we assume that each process is running on a different processor.
  • The processors do not share a common global memory and communicate solely by passing messages over the communication network.
  • There is no physical global clock in the system to which processes have instantaneous access.
  • The communication medium may deliver messages out of order, messages may be lost, garbled, or duplicated due to timeout and retransmission, processors may fail, and communication links may go down.
  • The system can be modeled as a directed graph in which vertices represent the processes and edges represent unidirectional communication channels.

78 of 119

System Model: Assumptions

We make the following assumptions:

• The systems have only reusable resources.

• Processes are allowed to make only exclusive access to resources.

• There is only one copy of each resource.

A process can be in two states, running or blocked. In the running state (also called active state), a process has all the needed resources and is either executing or is ready for execution. In the blocked state, a process is waiting to acquire some resource.

79 of 119

Wait-for graph (WFG)

  • In distributed systems, the state of the system can be modeled by directed graph, called a wait-for graph (WFG). In a WFG, nodes are processes and there is a directed edge from node P1 to node P2 if P1 is blocked and is waiting for P2 to release some resource. A system is deadlocked if and only if there exists a directed cycle or knot in the WFG.

80 of 119

Wait-for graph (WFG): Example

81 of 119

Models for deadlock

  • Distributed systems allow many kinds of resource requests. A process might require a single resource or a combination of resources for its execution.
  • This section introduces a hierarchy of request models starting with very restricted forms to the ones with no restrictions whatsoever.
  • This hierarchy shall be used to classify deadlock detection algorithms based on the complexity of the resource requests they permit.

82 of 119

The single-resource model

  • The single-resource model is the simplest resource model in a distributed system, where a process can have at most one outstanding request for only one unit of a resource.
  • Since the maximum out-degree of a node in a WFG for the single resource model can be 1, the presence of a cycle in the WFG shall indicate that there is a deadlock.
  • In a later section, an algorithm to detect deadlock in the single-resource model is presented.

83 of 119

AND Model

  • In the AND model, a process can request more than one resource simultaneously and the request is satisfied only after all the requested resources are granted to the process.
  • The requested resources may exist at different locations.
  • The out degree of a node in the WFG for AND model can be more than 1.
  • The presence of a cycle in the WFG indicates a deadlock in the AND model. Each node of the WFG in such a model is called an AND node.

84 of 119

OR model

  • In the OR model, a process can make a request for numerous resources simultaneously and the request is satisfied if any one of the requested resources is granted.
  • The requested resources may exist at different locations.
  • If all requests in the WFG are OR requests, then the nodes are called OR nodes.
  • Presence of a cycle in the WFG of an OR model does not imply a deadlock in the OR model.

85 of 119

AND – OR model

  • A generalization of the previous two models (OR model and AND model) is the AND-OR model. In the AND-OR model, a request may specify any combination of and and or in the resource request

86 of 119

Chandy–Misra–Haas algorithm :AND Model

  • Chandy–Misra–Haas’s distributed deadlock detection algorithm for the AND model , which is based on edge-chasing.
  • The algorithm uses a special message called probe, which is a triplet (i, j,k), denoting that it belongs to a deadlock detection initiated for process Pi and it is being sent by the home site of process Pj to the home site of process Pk.
  • A probe message travels along the edges of the global WFG graph, and a deadlock is detected when a probe message returns to the process that initiated it.
  • A process Pj is said to be dependent on another process Pk if there exists a sequence of processes Pj, Pi1, Pi2 ,..., Pim, Pk such that each process except Pk in the sequence is blocked and each process, except the Pj, holds a resource for which the previous process in the sequence is waiting. Process Pj is said to be locally dependent upon process Pk if Pj is dependent upon Pk and both the processes are on the same site.

87 of 119

Chandy–Misra–Haas algorithm :AND Model

  • Data structures
  • Each process Pi maintains a boolean array, dependenti, where dependentij
  • is true only if Pi knows that Pj is dependent on it. Initially, dependentij is false for all i and j.
  • The algorithm Algorithm (given in side figure) is executed to determine if a blocked process is deadlocked.
  • Therefore, a probe message is continuously circulated along the edges of the global WFG graph and a deadlock is detected when a probe message returns to its initiating process.

88 of 119

Chandy–Misra–Haas’s OR model

  • We now discuss Chandy–Misra–Haas’s distributed deadlock detection algorithm for the OR model.
  • It is based on the approach of diffusion computation (see Algorithm 10.2).
  • A blocked process determines if it is deadlocked by initiating a diffusion computation.
  • Two types of messages are used in a diffusion computation: �query(i, j, k) and reply(i, j, k),
  • The messages denotes that they belong to a diffusion computation initiated by a process Pi and are being sent from process Pj to process Pk.

89 of 119

Recovery from Distributed Deadlock

  • Once deadlock has been detected within a distributed system, there must be a way to recover from it (or why bother to detect it in the first place?).
  • We need to rollback or abort one or more processes.
  • (In a database system, a partial rollback may work.)
  • Hopefully, the same situation will not re-occur.

90 of 119

Possible methods for recovery:�

  • Operator intervention�    At one time, this was a feasible alternative for uniprocessor systems. However, it has little value for today's distributed systems.
  • Termination of Process(es)�    Some victim process (or set of processes) is chosen for termination from the cycle or knot of deadlocked processes.�This process is terminated, requiring a later restart.�All the resources allocated to this process are released, so that they may be reassigned to other deadlocked processes.�With an appropriately chosen victim process, this should resolve the deadlock.
  • Rolling Back Process(es)�    In order to rollback a victim process, there needs to have been some previous checkpoint at which time the state of the victim process was saved to stable storage.�This requires extra overhead.

There must also be an assurance that the rolled back process is not holding the resources needed by the other deadlocked processes at that point. With an appropriately chosen victim process, needed resources will be released and assigned to the other deadlocked processes. This should resolve the deadlock.

91 of 119

Issues in recovery methods:

  • How do we choose a victim process?�    By eliminating at least one process and releasing its resources, other processes should become unblocked. Clearly, we would prefer to choose a victim process that
    • minimizes the cost of recovery and
    • prevents starvation of processes.
  • Issues concerning recovery cost include the following:
    • the total number of processes affected;
    • the priorities of the different processes involved;
    • the natures of the different processes;
    • the number of resources involved with each process;
    • the types of resources involved with each process;
    • the length of time a process has been running;
    • whether or not a process has been victimized before.
  • The last item is also related to process starvation.

92 of 119

What needs to be done?

  •   Once the victim process has been chosen, all resources assigned to that process should be released and assigned to other deadlocked processes in order to break the deadlock.
  • Since this is a distributed system, there is also a need to update the information at all sites.
  • All information concerning the victim process should be cleared, so that all sites no longer treat the victim process as an active process.
  • What if the victim process performed some centralized function?�    That function may need to moved to another process.�Election algorithms should be able to help with this (e.g. the Bully algorithm).

93 of 119

Consensus and agreement algorithms

94 of 119

Problem definition

  • Agreement among the processes in a distributed system is a fundamental requirement for a wide range of applications.
  • Many forms of coordination require the processes to exchange information to negotiate with one another and eventually reach a common understanding or agreement, before taking application-specific actions.
  • A classical example is that of the commit decision in database systems, wherein the processes collectively decide whether to commit or abort a transaction that they participate in.
  • In this chapter, we study the feasibility of designing algorithms to reach agreement under various system models and failure models, and, where possible, examine some representative algorithms to reach agreement.

95 of 119

study of agreement algorithms: assumptions

Failure models

Synchronous/asynchronous communication

Network connectivity

Sender identification

Channel reliability

Authenticated vs. non-authenticated messages

Agreement variable

96 of 119

Failure models

  • Among the n processes in the system, at most f processes can be faulty.
  • A faulty process can behave in any manner allowed by the failure model assumed.
  • The various failure models – fail-stop, send omission and receive omission, and Byzantine failures
  • Recall that in the fail-stop model, a process may crash in the middle of a step, which could be the execution of a local operation or processing of a message for a send or receive event.
  • In particular, it may send a message to only a subset of the destination set before crashing.
  • In the Byzantine failure model, a process may behave arbitrarily.
  • The choice of the failure model determines the feasibility and complexity of solving consensus.

97 of 119

study of agreement algorithms: assumptions

Failure models

Synchronous/asynchronous communication

Network connectivity

Sender identification

Channel reliability

Authenticated vs. non-authenticated messages

Agreement variable

98 of 119

Synchronous/asynchronous communication

  • If a failure-prone process chooses to send a message to process Pi but fails, then Pi cannot detect the non-arrival of the message in an asynchronous system because this scenario is indistinguishable from the scenario in which the message takes a very long time in transit.
  • We will see this argument again when we consider the impossibility of reaching agreement in asynchronous systems in any failure model.
  • In a synchronous system, however, the scenario in which a message has not been sent can be recognized by the intended recipient, at the end of the round.
  • The intended recipient can deal with the non-arrival of the expected message by assuming the arrival of a message containing some default data, and then proceeding with the next round of the algorithm.

99 of 119

study of agreement algorithms: assumptions

Failure models

Synchronous/asynchronous communication

Network connectivity

Sender identification

Channel reliability

Authenticated vs. non-authenticated messages

Agreement variable

100 of 119

Network connectivity

The system has full logical connectivity, i.e., each process can communicate with any other by direct message passing.

101 of 119

study of agreement algorithms: assumptions

Failure models

Synchronous/asynchronous communication

Network connectivity

Sender identification

Channel reliability

Authenticated vs. non-authenticated messages

Agreement variable

102 of 119

Sender identification

  • A process that receives a message always knows the identity of the sender process.
  • This assumption is important – because even with Byzantine behavior, even though the payload of the message can contain fictitious data sent by a malicious sender, the underlying network layer protocols can reveal the true identity of the sender process.
  • When multiple messages are expected from the same sender in a single round, we implicitly assume a scheduling algorithm that sends these messages in sub-rounds, so that each message sent within the round can be uniquely identified

103 of 119

study of agreement algorithms: assumptions

Failure models

Synchronous/asynchronous communication

Network connectivity

Sender identification

Channel reliability

Authenticated vs. non-authenticated messages

Agreement variable

104 of 119

  • The channels are reliable, and only the processes may fail (under one of various failure models).
  • This is a simplifying assumption in our study.
  • As we will see even with this simplifying assumption, the agreement problem is either unsolvable, or solvable in a complex manner.

Channel reliability

105 of 119

study of agreement algorithms: assumptions

Failure models

Synchronous/asynchronous communication

Network connectivity

Sender identification

Channel reliability

Authenticated vs. non-authenticated messages

Agreement variable

106 of 119

  • In our study, we will be dealing only with unauthenticated messages.
  • With unauthenticated messages, when a faulty process relays a message to other processes,

(i) it can forge the message and claim that it was received from another process, and

(ii) it can also tamper with the contents of a received message before relaying it.

  • When a process receives a message, it has no way to verify its authenticity.
  • An unauthenticated message is also called an oral message or an unsigned message.
  • Using authentication via techniques such as digital signatures, it is easier to solve the agreement problem because, if some process forges a message or tampers with the contents of a received message before relaying it, the recipient can detect the forgery or tampering.
  • Thus, faulty processes can inflict less damage

Authenticated vs. non-authenticated messages

107 of 119

study of agreement algorithms: assumptions

Failure models

Synchronous/asynchronous communication

Network connectivity

Sender identification

Channel reliability

Authenticated vs. non-authenticated messages

Agreement variable

108 of 119

  • The agreement variable may be boolean or multivalued, and need not be an integer.
  • When studying some of the more complex algorithms, we will use a boolean variable.
  • This simplifying assumption does not affect the results for other data types, but helps in the abstraction while presenting the algorithms

Agreement variable

109 of 119

Byzantine Problem

  • Consider the difficulty of reaching agreement using the following example, that is inspired by the long wars fought by the Byzantine Empire in the Middle Ages.
  • Four camps of the attacking army, each commanded by a general, are camped around the fort of Byzantium.
  • They can succeed in attacking only if they attack simultaneously.
  • Hence, they need to reach agreement on the time of attack. The only way they can communicate is to send messengers among themselves.
  • The messengers model the messages.
  • An asynchronous system is modeled by messengers taking an unbounded time to travel between two camps.
  • A lost message is modeled by a messenger being captured by the enemy.
  • A Byzantine process is modeled by a general being a traitor.
  • The traitor will attempt to subvert the agreement-reaching mechanism, by giving misleading information to the other generals.
  • For example, a traitor may inform one general to attack at 10 a.m., and inform the other generals to attack at noon.
  • Or he may not send a message at all to some general. Likewise, he may tamper with the messages he gets from other generals, before relaying those messages.

110 of 119

Byzantine Problem Example

  • A simple example of Byzantine behavior is shown in Figure 14.1.
  • Four generals are shown, and a consensus decision is to be reached about a boolean value.
  • The various generals are conveying potentially misleading values of the decision variable to the other generals, which results in confusion.
  • In the face of such Byzantine behavior, the challenge is to determine whether it is possible to reach agreement, and if so under what conditions.
  • If agreement is reachable, then protocols to reach it need to be devised.

111 of 119

The Byzantine agreement problem

  • The Byzantine agreement problem requires a designated process, called the source process, with an initial value, to reach agreement with the other processes about its initial value, subject to the following conditions:
    • Agreement
      • All non-faulty processes must agree on the same value.
    • Validity
      • If the source process is non-faulty, then the agreed upon value by all the non-faulty processes must be the same as the initial value of the source.
    • Termination
      • Each non-faulty process must eventually decide on a value.

112 of 119

Validity

  • The validity condition rules out trivial solutions, such as one in which the agreed upon value is a constant.
  • It also ensures that the agreed upon value is correlated with the source value.
  • If the source process is faulty, then the correct processes can agree upon any value.
  • It is irrelevant what the faulty processes agree upon – or whether they terminate and agree upon anything at all.

113 of 119

Byzantine

  • There are two other popular flavors of the Byzantine agreement problem –
    • the consensus problem, and
    • the interactive consistency problem.

114 of 119

The consensus problem

  • The consensus problem differs from the Byzantine agreement problem in that each process has an initial value and all the correct processes must agree on a single value
  • Agreement All non-faulty processes must agree on the same (single) value.
  • Validity If all the non-faulty processes have the same initial value, then the agreed upon value by all the non-faulty processes must be that same value.
  • Termination Each non-faulty process must eventually decide on a value

115 of 119

The interactive consistency problem

  • The interactive consistency problem differs from the Byzantine agreement problem in that each process has an initial value, and all the correct processes must agree upon a set of values, with one value for each process.
  • Agreement All non-faulty processes must agree on the same array of values A[v1.. vn ]
  • Validity If process i is non-faulty and its initial value is vi, then all nonfaulty processes agree on vi as the ith element of the array A. If process j is faulty, then the non-faulty processes can agree on any value for Aj.
  • Termination Each non-faulty process must eventually decide on the array A.

116 of 119

117 of 119

Si

Sj

Sk

Sl

Sm

3:00 PM

118 of 119

Si

Sj

Sk

Sl

Sm

3:00 PM

3:00 PM

3:00 PM

3:00 PM

3:00 PM

3:02 PM

3:03 PM

3:04 PM

3:04 PM

119 of 119

Si

Sj

Sk

Sl

Sm