3 of 25

Definition

Synchronization is the coordination of events to operate a system in unison (same time or rate)
Systems that operate with all parts in synchrony are said to be synchronous or in sync—and those that are not are asynchronous.
Achieved via clocks –processes/Resource, Concurrency, Multiple nodes

4 of 25

Time and Global State Introduction

Physical time cannot be perfectly synchronized in a distributed system it is not possible to gather the global state of the system at a particular time.
The global state of a distributed system is the set of local states of each individual processes involved in the system plus the state of the communication channels

5 of 25

� CLOCK SYNCHRONIZATION

Every computer needs a timer mechanism (called a computer clock) to keep track of current time and also For various accounting purposes such as calculating the time spent by a process in CPU utilization, disk I/O and so on, so that the corresponding user can be charged properly.
Accounting-resource – CPU, memory, Hard Disk

Clocks – crystal oscillate at the same freq

6 of 25

…

It is the job of a distributed operating system designer to devise and use suitable algorithms for properly synchronizing the clocks of a distributed system

7 of 25

MUTUAL EXCLUSION

Several resources in a system that must not be used simultaneously by multiple processes if program operation is to be correct.

File must not be simultaneously updated by multiple processes-DBMS
Printers must be restricted to a single process at a time. –spooling-Queue
Processes(A,B) -🡪Printer

Therefore, exclusive access to such a shared resource by a process must be ensured.

Mutual exclusion - are introduced to prevent processes from executing concurrently within their associated critical sections/resources.

8 of 25

…

An algorithm for implementing mutual exclusion must satisfy the following requirements:

Issues in Recovery from Deadlock:

Selection of Victim(s): -min recovery cost, prevention of starvation
Use of Transaction Mechanism:

9 of 25

i) Selection of Victims

Based on two major factors

1. Minimization of recovery cost - those processes should be selected as victims whose termination / rollback will incur the minimum recovery cost.

Some of the factors that may be considered for this purpose are

a) The priority of the processes; (BANK- P1-opening an account, P2-checking balance, P3-depositing)

b) The nature of the processes, such as interactive or batch and possibility of return with no ill effects;

c) The number and types of resources held by the processes;

d) The length of service already received and the expected length of service further needed by the processes; and

e) The total number of processes that will be affected.

2. Prevention of starvation:

If a system only aims at minimization of recovery cost, it may happen that the same process (probably because its priority is very low) is repeatedly selected as a victim and may never complete.

10 of 25

ii)Use of Transaction Mechanism:

The use of transaction mechanism - ensures all or no effect -DBMS

11 of 25

ELECTION ALGORITHMS

Assumptions

1. Each process in the system has a unique priority number.

2. Whenever an election is held, the process having the highest priority number among the currently active processes is elected as the coordinator.

3. On recovery, a failed process can take appropriate actions to rejoin the set of active processes.

12 of 25

1. Bully Algorithm:

1) Process P sends an ELECTION message to all the processes with higher numbers.

2) If no one responds, P wins the election and becomes the coordinator.

3) If one higher process answers; it takes over the job and P’s job is done.

13 of 25

The Bully election algorithm �

a) Process 4 holds an election.

b) Process 5 and 6 respond, telling 4 to stop.

c) Now 5 and 6 each hold an election.

d) Process 6 tells 5 to stop.

e) Process 6 wins and tells everyone.

14 of 25

ii) Ring Algorithm:

Two process, Number 2 and Number 5 discover together that the previous coordinator; Number 7 has crashed.
Number 2 and Number 5 will each build an election message and start circulating it along the ring.
Both the messages in the end will go to Number 2 and Number 5 and they will convert the message into the coordinator with exactly the same number of members and in the same order. When both such messages have gone around the ring, they both will be discarded and the process of election will re-start.

15 of 25

Centralized Algorithm

Process 1 asks the coordinator for permission to enter a critical region. Permission is granted.

b) Process 2 then asks permission to enter the same critical region. The coordinator does not reply.

c) When process 1 exits the critical region, it tells the coordinator, when then replies to process 2.

16 of 25

Mutual Exclusion Algorithms

Other Algorithms for Mutual exclusion

Distributed Algorithm
Token Ring Algorithm

17 of 25

Distributed Debugging

Distributed systems involve multiple computers working together to achieve a common goal. Debugging these systems can be challenging due to their complexity and the need for coordination between different parts. The Debugging Techniques in Distributed Systems explore various methods to identify and fix errors in such environments. It covers techniques like logging, tracing, and monitoring, which help track system behavior and locate issues.
Issues such as synchronization errors, concurrency bugs, and network failures are common challenges in distributed systems. Debugging aims to ensure that all parts of the system work correctly and efficiently together, maintaining overall system reliability and performance.

18 of 25

Common failure in Distributed Systems

Network Issues: Problems such as latency, packet loss, jitter, and disconnections can disrupt communication between nodes, causing data inconsistency and system downtime.
Concurrency Problems: Simultaneous operations on shared resources can lead to race conditions, deadlocks, and livelocks, which are difficult to detect and resolve.
Data Consistency Errors: Ensuring data consistency across multiple nodes can be challenging, leading to replication errors, stale data, and partition tolerance issues.
Faulty Hardware: Failures in physical components like servers, storage devices, and network infrastructure can introduce errors that are difficult to trace back to their source.
Software Bugs: Logical errors, memory leaks, improper error handling, and bugs in the code can cause unpredictable behavior and system crashes.
Configuration Mistakes: Misconfigured settings across different nodes can lead to inconsistencies, miscommunications, and failures in the system’s operation.

19 of 25

Logging

What is Logging?
Logging involves capturing detailed records of events, actions, and state changes within the system. Key aspects include:
Centralized Logging: Collect logs from all nodes in a centralized location to facilitate easier analysis and correlation of events across the system.
Log Levels: Use different log levels (e.g., DEBUG, INFO, WARN, ERROR) to control the verbosity of log messages, allowing for fine-grained control over the information captured.
Structured Logging: Use structured formats (e.g., JSON) for log messages to enable better parsing and searching.
Contextual Information: Include contextual details like timestamps, request IDs, and node identifiers to provide a clear picture of where and when events occurred.
Error and Exception Logging: Capture stack traces and error messages to understand the root causes of failures.
Log Rotation and Retention: Implement log rotation and retention policies to manage log file sizes and storage requirements.

20 of 25

Monitoring

Metrics Collection: Collect various performance metrics (e.g., CPU usage, memory usage, disk I/O, network latency) from all nodes.
Health Checks: Implement regular health checks for all components to ensure they are functioning correctly.
Alerting: Set up alerts for critical metrics and events to notify administrators of potential issues in real-time.
Visualization: Use dashboards to visualize metrics and logs, making it easier to spot trends, patterns, and anomalies.
Tracing: Implement distributed tracing to follow the flow of requests across different services and nodes, helping to pinpoint where delays or errors occur.
Anomaly Detection: Use machine learning and statistical techniques to automatically detect unusual patterns or behaviors that may indicate underlying issues.

21 of 25

Distributed Tracing

Distributed Tracing extends traditional tracing to distributed systems, where requests may traverse multiple services, databases, and other components spread across different locations. Key aspects include:
Trace Propagation: Passing trace context (e.g., trace ID and span ID) along with requests to maintain continuity as they move through the system.
End-to-End Visibility: Capturing traces across all services and components to get a comprehensive view of the entire request lifecycle.
Latency Analysis: Measuring the time spent in each service or component to identify where delays or performance issues occur.
Error Diagnosis: Pinpointing where errors happen and understanding their impact on the overall request.

22 of 25

Multicast Communication

Multicast communication is a type of communication where a single message is sent to a group of processes. It uses a multicast address as a filter to determine which processes receive the message. This allows for efficient communication between multiple processes and can be implemented using a broadcast mechanism.

23 of 25

Importance Of Group Communication

Coordination and Synchronization:Distributed systems often involve multiple nodes or entities that need to collaborate and synchronize their activities.
Efficient Information Sharing:In distributed systems, different nodes may generate or process data that needs to be shared among multiple recipients.
Fault Tolerance and Reliability:Group communication protocols often include mechanisms for ensuring reliability and fault tolerance.
Scalability:Group communication mechanisms are designed to handle increasing numbers of nodes and messages without compromising performance or reliability.

24 of 25

Consensus in Distributed Systems

Consensus in a distributed system, then, is the notion that all of the nodes agree on a state variable's value.
Distributed consensus in distributed systems refers to the process by which multiple nodes or components in a network agree on a single value or a course of action despite potential failures or differences in their initial states or inputs. It is crucial for ensuring consistency and reliability in decentralized environments where nodes may operate independently and may experience delays or failures.

25 of 25

Importance of Distributed Consensus in Distributed Systems�

Consistency and Reliability:

Distributed consensus ensures that all nodes in a distributed system agree on a common state or decision. This consistency is crucial for maintaining data integrity and preventing conflicting updates.

Fault Tolerance:

Distributed consensus mechanisms enable systems to continue functioning correctly even if some nodes experience failures or network partitions. By agreeing on a consistent state, the system can recover and continue operations smoothly.

Decentralization:

In decentralized networks, where nodes may operate autonomously, distributed consensus allows for coordinated actions and ensures that decisions are made collectively rather than centrally. This is essential for scalability and resilience.

Concurrency Control:

Consensus protocols help manage concurrent access to shared resources or data across distributed nodes. By agreeing on the order of operations or transactions, consensus ensures that conflicts are avoided and data integrity is maintained.