3 of 27

Distributed Database Concepts

A transaction can be executed by multiple networked computers in a unified manner.
A distributed database (DDB) processes Unit of execution (a transaction) in a distributed manner. A distributed database (DDB) can be defined as

A distributed database (DDB) is a collection of multiple logically related database distributed over a computer network, and a distributed database management system as a software system that manages a distributed database while making the distribution transparent to the user.

4 of 27

Advantages

Management of distributed data with different levels of transparency:

This refers to the physical placement of data (files, relations, etc.) which is not known to the user (distribution transparency).

5 of 27

Advantages (transparency, contd.)

The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally and stored with possible replication as shown below.

6 of 27

Advantages (transparency, contd.)

Distribution and Network transparency:

Users do not have to worry about operational details of the network.

There is Location transparency, which refers to freedom of issuing command from any location without affecting its working.
Then there is Naming transparency, which allows access to any names object (files, relations, etc.) from any location.

Replication transparency:

It allows to store copies of a data at multiple sites as shown in the above diagram.
This is done to minimize access time to the required data.

Fragmentation transparency:

Allows to fragment a relation horizontally (create a subset of tuples of a relation) or vertically (create a subset of columns of a relation).

7 of 27

Other Advantages

Increased reliability and availability:

Reliability refers to system live time, that is, system is running efficiently most of the time. Availability is the probability that the system is continuously available (usable or accessible) during a time interval.
A distributed database system has multiple nodes (computers) and if one fails then others are available to do the job.

Improved performance:

A distributed DBMS fragments the database to keep data closer to where it is needed most.
This reduces data management (access and modification) time significantly.

Easier expansion (scalability):

Allows new nodes (computers) to be added anytime without chaining the entire configuration.

8 of 27

Data Fragmentation

Split a relation into logically related and correct parts. A relation can be fragmented in two ways:

Horizontal Fragmentation
Vertical Fragmentation

Horizontal fragmentation

It is a horizontal subset of a relation which contain those of tuples which satisfy selection conditions.
Consider the Employee relation with selection condition (DNO = 5). All tuples satisfy this condition will create a subset which will be a horizontal fragment of Employee relation.
A selection condition may be composed of several conditions connected by AND or OR.
Derived horizontal fragmentation: It is the partitioning of a primary relation to other secondary relations which are related with Foreign keys.

9 of 27

Vertical fragmentation

It is a subset of a relation which is created by a subset of columns. Thus a vertical fragment of a relation will contain values of selected columns. There is no selection condition used in vertical fragmentation.
Consider the Employee relation. A vertical fragment of can be created by keeping the values of Name, Bdate, Sex, and Address.
Because there is no condition for creating a vertical fragment, each fragment must include the primary key attribute of the parent relation Employee. In this way all vertical fragments of a relation are connected.

10 of 27

Representation

Horizontal fragmentation

Each horizontal fragment on a relation can be specified by a σ_Ci (R) operation in the relational algebra.
Complete horizontal fragmentation
A set of horizontal fragments whose conditions C1, C2, …, Cn include all the tuples in R- that is, every tuple in R satisfies (C1 OR C2 OR … OR Cn).
Disjoint complete horizontal fragmentation: No tuple in R satisfies (Ci AND Cj) where i ≠ j.
To reconstruct R from horizontal fragments a UNION is applied.

11 of 27

Representation

Vertical fragmentation

A vertical fragment on a relation can be specified by a Π_Li(R) operation in the relational algebra.
Complete vertical fragmentation
A set of vertical fragments whose projection lists L1, L2, …, Ln include all the attributes in R but share only the primary key of R. In this case the projection lists satisfy the following two conditions:
L1 ∪ L2 ∪ ... ∪ Ln = ATTRS (R)
Li ∩ Lj = PK(R) for any i j, where ATTRS (R) is the set of attributes of R and PK(R) is the primary key of R.
To reconstruct R from complete vertical fragments a OUTER UNION is applied.

12 of 27

Representation

Mixed (Hybrid) fragmentation

A combination of Vertical fragmentation and Horizontal fragmentation.
This is achieved by SELECT-PROJECT operations which is represented by Π_Li(σ_Ci (R)).
If C = True (Select all tuples) and L ≠ ATTRS(R), we get a vertical fragment, and if C ≠ True and L ≠ ATTRS(R), we get a mixed fragment.
If C = True and L = ATTRS(R), then R can be considered a fragment.

13 of 27

Fragmentation schema

A definition of a set of fragments (horizontal or vertical or horizontal and vertical) that includes all attributes and tuples in the database that satisfies the condition that the whole database can be reconstructed from the fragments by applying some sequence of UNION (or OUTER JOIN) and UNION operations.

Allocation schema

It describes the distribution of fragments to sites of distributed databases. It can be fully or partially replicated or can be partitioned.

14 of 27

Data Replication

Database is replicated to all sites.
In full replication the entire database is replicated and in partial replication some selected part is replicated to some of the sites.
Data replication is achieved through a replication schema.

Data Distribution (Data Allocation)

This is relevant only in the case of partial replication or partition.
The selected portion of the database is distributed to the database sites.

15 of 27

Replication

Replication in a Distributed System refers to the process of storing copies of the same data on multiple servers or sites within the network.
The main goal of replication is to increase data availability, reliability, and performance — ensuring that even if one server fails, the data can still be accessed from another replica.
In simple terms, replication means creating and maintaining identical copies of data across different nodes in a distributed database.

16 of 27

Primary-Backup Replication

Primary-Backup Replication (also known as active-passive replication) involves designating one primary replica (active) to handle all updates (writes), while one or more backup replicas (passive) maintain copies of the data and synchronize with the primary.

17 of 27

Advantages

Strong Consistency: Since all updates go through the primary replica, read operations can be served with strong consistency guarantees.
Fault Tolerance: If the primary replica fails, one of the backup replicas can be promoted to become the new primary, ensuring continuous availability.

Disadvantages

Latency for Reads: Read operations might experience latency because they might need to wait for updates to propagate from the primary to the backup replicas.
Resource Utilization: Backup replicas are often idle unless a failover occurs, which can be seen as inefficient resource utilization.

18 of 27

Multi-Primary Replication

Multi-Primary Replication allows multiple replicas to accept updates independently. Each replica acts as both a client (accepting updates) and a server (propagating updates to other replicas).

19 of 27

Advantages

Increased Write Throughput: Multiple replicas can handle write requests concurrently, improving overall system throughput.
Lower Write Latency: Writes can be processed locally at each replica, reducing the latency compared to centralized primary-backup models.
Fault Tolerance: Even if one replica fails, other replicas can continue to accept writes and serve read operations.

Disadvantages

Conflict Resolution: Concurrent updates across multiple primaries can lead to conflicts that need to be resolved, typically using techniques like conflict detection and resolution algorithms (e.g., timestamp ordering or version vectors).
Consistency Management: Ensuring consistency across all replicas can be complex, especially in distributed environments with network partitions or communication delays.

20 of 27

Synchronous Replication

All replicas are updated simultaneously when a change occurs.
Guarantees strong consistency, but can be slower due to communication overhead.
Example: Banking transactions, where data accuracy is critical.

Asynchronous Replication

Updates are made to one replica first and then propagated to others later.
Provides better performance, but there may be a temporary data inconsistency.
Example: Social media apps where slight delays are acceptable.

Semi-Synchronous Replication

A hybrid approach where some replicas are updated immediately, while others are updated later.

21 of 27

Advantages of Replication

High Availability:

If one node fails, data can still be accessed from another replica.

Improved Performance:

Users can access data from the nearest replica, reducing latency and response time.

Fault Tolerance:

System continues to operate even if some nodes or links fail.

Load Balancing:

Read requests can be distributed across replicas, reducing the load on any single server.

Disaster Recovery:

Replicas can be used to restore data after a failure or disaster.

22 of 27

Disadvantages of Replication

Data Inconsistency:

If updates are not synchronized properly, replicas may contain different versions of the same data.

Increased Storage Cost:

More copies of data require more storage space.

Complex Maintenance:

Managing and synchronizing multiple replicas adds system complexity.

Network Overhead:

Frequent updates between replicas can increase network traffic.

23 of 27

Fragmentation in Distributed Database

Fragmentation in a Distributed Database System (DDBS) is the process of dividing a large database into smaller, more manageable pieces, called fragments, which are stored at different sites in a distributed network.

Each fragment is a subset of the entire database, and together, all fragments form the complete database.�The goal of fragmentation is to improve performance, reliability, and manageability by keeping data closer to where it is most frequently accessed.

24 of 27

Key Objectives of Fragmentation

Improved Performance:

Queries can be executed locally, reducing communication delays.

Increased Availability:

Even if one site fails, other sites still contain useful fragments of data.

Security and Control:

Sensitive data can be stored only at secure or authorized sites.

Parallelism:

Different fragments can be processed simultaneously for faster query execution.

Reduced Data Transfer:

Users access only relevant fragments instead of the entire database.

25 of 27

Types of Fragmentation

1. Horizontal Fragmentation

In Horizontal Fragmentation, the table is divided into rows (tuples) based on certain conditions.
Each fragment contains a subset of rows that satisfy a specific WHERE condition, but all fragments have the same columns.

26 of 27

Example

Fragments:

Fragment 1 (Delhi Employees): Rows where City = 'Delhi'
Fragment 2 (Other Employees): Rows where City != 'Delhi'

Emp_ID	Name	Dept	City
101	Raj	HR	Delhi
102	Neha	IT	Mumbai
103	Aman	HR	Chennai
104	Ravi	IT	Delhi

27 of 27

2. Vertical Fragmentation

In Vertical Fragmentation, the table is divided into columns (attributes).
Each fragment contains a subset of columns, and all fragments share a common p

Example:

Same EMPLOYEE table divided as:
Fragment 1: (Emp_ID, Name, Dept)
Fragment 2: (Emp_ID, City)
primary key to allow reconstruction of the original table.