1 of 23

Computer Organization and Architecture

Module V: Multiprocessors and Parallel Processing

Name: Prof.Reshma Kohad

2 of 23

Parallel Processing Fun- damentals

What is parallel processing

Parallel processing is used to increase the computational speed of computer systems by performing multiple data-processing operations simultaneously.
For example, while an instruction is being executed in ALU, the next instruction can be read from memory.
The system can have two or more ALUs and be able to execute multiple instructions at the same time.
In addition, two or more processing is also used to speed up computer processing capacity and increases with parallel processing, and with it, the cost of the system increases.
But, technological development has reduced hardware costs to the point where parallel processing methods are economically possible.
Parallel processing derives from multiple levels of complexity. It is distinguished between parallel and serial operations by the type of registers used at the lowest level.

3 of 23

The main advantage of parallel processing is that it provides better utilization of system resources by increasing resource multiplicity which overall system throughput.

4 of 23

need for parallel processing include:�

Speed and Efficiency: It enables faster execution of complex, large-scale, or repetitive computations by dividing them among multiple processors.

Handling Large Data (Big Data): It is critical for analyzing huge, diverse, and fast-growing datasets that a single processor cannot handle efficiently.

Real-Time Simulation & Modeling: Necessary for complex, time-sensitive simulations like climate modeling, financial modeling, and engineering simulations.

Scalability: Systems can handle larger, more complex tasks simply by adding more processors.

Improved Reliability & Fault Tolerance: If one processor fails, the entire system does not crash, as the workload can be distributed among remaining processors.

Application Areas: Crucial for advanced graphics rendering, scientific simulations, AI/Machine Learning, and data analytics

5 of 23

Speedup and performance metrics

Amdahl’s Law, proposed by Gene Amdahl in 1967, explains the theoretical speedup of a program when part of it is improved or parallelized.
It is widely used in parallel computing to predict the benefits of using multiple processors.
The main idea is that the speedup of a system is limited by the portion of the program that cannot be parallelized (the sequential part).

Key Terms
Speedup (S):�
Performance improvement gained by enhancement.
S=New execution time/old execution time
Fraction Enhanced (P):�The proportion of the program that can be parallelized (0 < P < 1).

Number of Processors (N):�The number of parallel units used for execution.

6 of 23

Formula

S=1/(1-P)+P/N
(1 - P): sequential portion (cannot be parallelized).
P/N: parallel portion divided among N processors.

Maximum Speedup

If processors are unlimited (N → ∞)
Smax=1/1-P
This means the non-parallelizable fraction sets the performance limit.
If P = 1 (100% parallelizable), theoretical speedup is infinite (not realistic).

10 of 23

Example�Suppose a program spends 20% (P = 0.2) of its time in parallelizable work, and we use 5 processors (N = 5):�

Smax=1/(1-P)+P/N
S=1/(1−0.2)+0.25
S=10.8+0.04=1.19S
The system improves by only 19%, showing that the 80% sequential part is the bottleneck.

11 of 23

Advantages�

Provides a clear upper bound on performance.
Helps identify bottlenecks in programs.
Useful in guiding hardware/software design decisions.

Disadvantages

Assumes the sequential part is fixed (in practice, it can sometimes be optimized).
Assumes processors are identical, not always true in heterogeneous systems.
Ignores real-world factors like communication, synchronization, and load balancing overhead.

�

12 of 23

Key Performance Metrics

Speedup (S):The ratio of the time taken to solve a problem on a single processor (Ts) to the time taken on p processors (Tp). Ideally, S = p (linear speedup), though it is often less due to overhead.

Formula: S =Ts/Tp

Efficiency (E):Measures the fraction of time processors are usefully employed. It is the ratio of speedup to the number of processors.

Formula: E = S/P

Parallel Overhead (T0):The total time spent by all processing elements on tasks not present in the sequential execution, such as communication, synchronization, and idle time.

Formula:T0=pTp-Ts

13 of 23

Scalability:A system's ability to maintain performance gains (efficiency) as the number of processors (p) and problem size (W) increase.
Cost (C):The product of parallel runtime (Tp) and the number of processors (p). A parallel system is cost-optimal if its cost is proportional to the sequential execution time.

Formula:C=p*Tp

Throughput:The total amount of work done in a given amount of time.

14 of 23

Multi processor Systems: Shared-memory systems

Shared memory in computer architecture is a memory model where multiple processors or cores access a common, central, global physical memory space, facilitating fast inter-process communication and data sharing.
It enables efficient parallel processing, though it requires cache coherence mechanisms (e.g., protocols to manage shared data copies in local caches).

15 of 23

Shared-memory systems

In shared-memory multiprocessors, numerous processors are accessing one or more shared memory modules.
The processors may be physically connected to the memory modules in many ways, but logically every processor is connected to every memory module.
One of the major characteristics of shared memory multiprocessors is that all processors have equally direct access to one large memory address space.
The limitation of shared memory multiprocessors is memory access latency.

16 of 23

Shared Memory Types

Uniform Memory Access (UMA): All processors have equal access time to all memory locations (Symmetric Multiprocessor or SMP).

Non-Uniform Memory Access (NUMA): Memory access time depends on the memory location relative to the processor, typically with local memory for each processor and a shared global memory.

Cache Only Memory Architecture (COMA): A specialized type where all shared memory is utilized only as cache

Interconnection Networks

Bus-based: Processors are connected to a shared bus, simple but limited by bus traffic.

Crossbar Switch: Provides direct paths between processors and memory modules, offering high performance but increased complexity.

Multi-stage Networks: Intermediate switches that provide a middle ground for scalability and speed

17 of 23

Key Concepts in Shared Memory

Core Architecture: All CPUs have access to a single block of RAM, allowing them to read/write data in a shared space, which is critical in multiprocessor systems.
Inter-Process Communication (IPC): Shared memory is used to allow multiple programs or threads to access the same memory, reducing the overhead of copying data between processes.
Cache Coherence:To maintain data consistency across CPUs with private caches, these systems must ensure that all processors see the same data in shared memory.
Performance Bottleneck: As the number of processors increases, contention for the shared memory can create bottlenecks, limiting scalability compared to distributed memory.

18 of 23

Advantages of the Shared Memory Model

Fast Communication: As seen, all the processes can directly access the memory and therefore, the rates of communication are fast.

Efficient for Large Data Transfers: The data structures can be passed from one process to another in large blocks at once, however, there is no need to copy it in every process memory space.

Less Kernel Involvement: This was said because after the shared memory space has been established, the kernel doesn’t have to keep on transferring data from one process to another, which in the long run, is costly.

19 of 23

Disadvantages of the Shared Memory Model�

Complex Synchronization: A process has to take care of such synchronization tools and gadgets as semaphores or mutexes to prevent racing conditions.
Security Risks: Sharing memory is disadvantageous because processes that are running in this location are vulnerable to security threats as everyone who ard the data can access it.
Limited to a Single Machine: In a distributed system it can only be used for processes inside the same machine.

20 of 23

Multi processor Systems: Message-passing systems

Message-passing multiprocessors are computer architectures where multiple independent processors (nodes) possess their own private local memory, communicating explicitly via an interconnection network rather than a shared global memory.

21 of 23

Key Characteristics and Components

Node Architecture: Each node consists of a processor, local memory, and a network interface.
Explicit Communication: Data is exchanged using explicit SEND and RECEIVE commands inserted into the application software.
Interconnection Networks: Nodes are connected in various topologies (e.g., mesh, hypercube) to pass messages.
Message Ordering: Protocols ensure messages arrive in a defined sequence (FIFO, Causal, or Total ordering).

Advantages

Scalability: Systems can scale to a very large number of processors without performance degradation.
Reduced Contention: Lack of shared memory avoids bottlenecks on memory access.
Simple Hardware: Local memory design is conceptually simpler than maintaining cache coherence in large shared-memory systems

22 of 23