1 of 42

Chapter-5 Pipeline Processing

Parallel Processing : -

Pipelining,
Arithmetic Pipeline
Instruction Pipeline,
RISC Pipeline.

2 of 42

Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose of increasing the computational speed of a computer system.

Instead of processing each instruction sequentially as in a conventional computer, a parallel processing system is able to perform concurrent data processing to achieve faster execution time.

3 of 42

For example, while an instruction is being executed in the ALU, the next instruction can be read from memory. The system may have two or more ALUs and be able to execute two or more instructions at the same time.

The purpose of parallel processing is to speed up the computer processing capability and increase its throughput, that is the amount of processing that can be accomplished during a given interval of time.

4 of 42

The amount of hardware increases with parallel processing. and with it, the cost of the system increases.
Parallel processing can be viewed from various levels of complexity.

At the lowest level, we distinguish between parallel and serial operations by the type of registers used.
Shift registers operate in serial fashion one bit at a time,

5 of 42

Parallel processing at a higher level of complexity can be achieved by having a multiplicity of functional units that perform identical or different operations simultaneously.

Parallel processing is established by distributing the data among the multiple functional units.
For example, the arithmetic, logic, and shift operations can be separated into three units and the operands diverted to each unit under the supervision of a control unit.

7 of 42

There are a variety of ways that parallel processing can be classified.
It can be considered from the internal organization of the processors, from the interconnection structure between processors, or from the flow of information through the system.

One classification introduced by M. J. Flynn considers the organization of a computer system by the number of instructions and data items that are manipulated simultaneously.

8 of 42

Four major groups as follows:

Flynn's classification divides computers….

Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data stream (SIMD)
Multiple instruction stream, single data stream (MISD)
Multiple instruction stream, multiple data stream (MIMD)

10 of 42

SISD represents the organization of a single computer containing a control unit, a processor unit, and a memory unit. Instructions are executed sequentially and the system may or may not have internal parallel processing capabilities. Parallel processing in this case may be achieved by means of multiple functional units or by pipeline processing.

SIMD represents an organization that includes many processing units under the supervision of a common control unit. All processors receive the same instruction from the control unit but operate on different items of data. The shared memory unit must contain multiple modules so that it can communicate with all the processors simultaneously.

11 of 42

MISD structure is only of theoretical interest since no practical system has been constructed using this organization.

MIMD organization refers to a computer system capable of processing several programs at the same time. Most multiprocessor and multicomputer systems can be classified in this category.

12 of 42

In this chapter we consider parallel processing under the following main topics:
1. Pipeline processing
2. Vector processing
3. Array processors

Pipeline processing is an implementation technique where arithmetic sub-operations or the phases of a computer instruction cycle overlap in execution.
Vector processing deals with computations involving large vectors and matrices.
Array processors perform computations on large arrays of data.

13 of 42

Pipelining

Pipelining is a technique of decomposing a sequential process into sub-operations, with each sub-process being executed in a special dedicated segment that operates concurrently with all other segments.
A pipeline can be visualized as a collection of processing segments through which binary information flows.
Each segment performs partial processing dictated by the way the task is partitioned.
The result obtained from the computation in each segment is transferred to the next segment in the pipeline.

14 of 42

Pipelining (Continue)

The final result is obtained after the data have passed through all segments. The name "pipeline" implies a flow of information analogous to an industrial assembly line. It is characteristic of pipelines that several computations can be in progress in distinct segments at the same time.
The overlapping of computation is made possible by associating a register with each segment in the pipeline.
The registers provide isolation between each segment so that each can operate on distinct data simultaneously.

15 of 42

Pipeline (Example)

Suppose that we want to perform the combined multiply and add operations with a stream of numbers.
Ai * Bi + Ci for i = 1, 2, 3, . . . , 7
Each sub operation is to be implemented in a segment within a pipeline.
Each segment has one or two registers and a combinational circuit as shown in Fig.

16 of 42

The suboperations performed in each segment of the pipeline are as follows:

17 of 42

Example of pipelining

19 of 42

Table Operation

The first clock pulse transfers A1 and 81 into R 1 and R2. The second dock pulse transfers the product of R 1 and R2 into R3 and C1 into R4. The same clock pulse transfers A2 and B2 into R 1 and R2. The third clock pulse operates on all three segments simultaneously. It places A, and B, into R1 and R2, transfers the product of R1 and R2 into R3, transfers C, into R4, and places the sum of R3 and R4 into RS. It takes three clock pulses to fill up the pipe and retrieve the first output from RS. From there on, each dock produces a new output and moves the data one step down the pipeline.
This happens as long as new input data flow into the system. When no more input data are available, the clock must continue until the last output emerges out of the pipeline.

20 of 42

Four segment pipeline.

21 of 42

Four segment pipeline (Explain)

The operands pass through all four segments in a fixed sequence. Each segment consists of a combinational circuit S; that performs a sub operation over the data stream flowing through the pipe.
The segments are separated by registers R; that hold the intermediate results between the stages.

22 of 42

Space Time diagram

The horizontal axis displays the time in clock cycles and the vertical axis gives the segment number. The diagram shows six tasks T1 through T6 executed in four segments. Initially, task T1 is handled by segment 1. After the first clock, segment 2 is busy with T1, while segment 1 is busy with task T2.
Continuing in this manner, the first task T1 is completed after the fourth clock cycle. From then on, the pipe completes a task every clock cycle.
No matter how many segments there are in the system, once the pipeline is full, it takes only one clock period to obtain an output.

23 of 42

Space Time diagram for pipeline

24 of 42

speedup

The speedup of a pipeline processing over an equivalent non pipeline processing is defined by the ratio

To clarify the meaning of the speedup ratio, consider the following numerical example. Let the time it takes to process a sub operation in each segment be equal to t, = 20 ns. Assume that the pipeline has k = 4 segments and executes n = 100 tasks in sequence. The pipeline system will take (k + n - 1)t, = (4 + 99) x 20 = 2060 ns to complete. Assuming that t, = kt, = 4 x 20 = 80 ns, a non pipeline system requires nkt, = 100 x 80 = 8000 ns to complete the 100 tasks. The speedup ratio is equal to 8000/2060 = 3 .88.

25 of 42

where four identical circuits are connected in parallel. Each P circuit performs the same task of an equivalent pipeline circuit. Instead of operating with the input data in sequence as in a pipeline, the parallel circuits accept four input data items simultaneously and perform four tasks at the same time.

26 of 42

Arithmetic Pipeline

Pipeline arithmetic units are usually found in very high speed computers. They are used to implement floating-point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems.

27 of 42

Example �Floating-point addition and subtraction.

The inputs to the floating-point adder pipeline are two normalized floating- point binary numbers.

A and B are two fractions that represent the mantissas and a and b are the exponents.
The floating-point addition and subtraction can be performed in four segments, as shown in Fig.

28 of 42

Operation

The suboperations that are performed in the four segments are:

1. Compare the exponents.

2. Align the mantissas.

3. Add or subtract the mantissas.

4. Normalize the result.

30 of 42

Instruction Pipeline

Pipeline processing can occur not only in the data stream but in the instruction stream as well.
An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments.
This causes the instruction fetch and execute phases to overlap and perform simultaneous operations.
One possible digression associated with such a scheme is that an instruction may cause a branch out of sequence.
In that case the pipeline must be emptied and all the instructions that have been read from memory after the branch instruction must be discarded.
The instruction fetch segment can be implemented by means of a first-in, first-out (FIFO) buffer. This is a type of unit that forms a queue rather than a stack.

31 of 42

In the most general case, the computer needs to process each instruction with the following sequence of steps :
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5 . Execute the instruction.
6. Store the result in the proper place.

32 of 42

Four segment CPU pipeline

33 of 42

That Figure shows how the instruction cycle in the CPU can be processed with a four-segment pipeline. While an instruction is being executed in segment 4, the next instruction in sequence is busy fetching an operand from memory in segment 3.

The effective address may be calculated in a separate arithmetic circuit for the third instruction, and whenever the memory is available, the fourth and all subsequent instructions can be fetched and placed in an instruction FIFO.

Thus up to four sub operations in the instruction cycle can overlap and up to four different instructions can be in progress of being processed at the same time.

34 of 42

The four segments are represented in the diagram with an abbreviated symbol.

FI is the segment that fetches an instruction.
DA is the segment that decodes the instruction and calculate the effective address.
FO is the segment that fetches the operand.
EX is the segment that executes the instruction.�

35 of 42

Timing of instruction pipeline

36 of 42

In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal operation.
1. Resource conflicts caused by access to memory by two segments at the same time. Most of these conflicts can be resolved by using separate instruction and data memories.
2. Data dependency conflicts arise when an instruction depends on the result of a previous instruction, but this result is not yet available.
3. Branch difficulties arise from branch and other instructions that change the value of PC .

37 of 42

RISC Pipeline

Among the characteristics attributed to ruse is its ability to use an efficient instruction pipeline.

The simplicity of the instruction set can be utilized to implement an instruction pipeline using a small number of sub operations, with each being executed in one clock cycle.

38 of 42

The major advantages of RISC is its ability to execute instructions at the rate of one per clock cycle. It is not possible to expect that every instruction be fetched from memory and executed in one clock cycle.

What is done, in effect, is to start each instruction with each clock cycle and to pipeline the processor to achieve the goal of single-cycle instruction execution.

The advantage of RISC over OSC (complex instruction set computer) is that RISC can achieve pipeline segments, requiring just one clock cycle, while OSC uses many segments in its pipeline, with the longest segment requiring two or more clock cycles

39 of 42

Example: Three-Segment Instruction Pipeline

A typical set of instructions for a RISC processor are listed in Table 8-12.
The instruction cycle can be divided into three sub operations and implemented in three segments:

I: Instruction fetch
A: ALU operation
E: Execute instruction

40 of 42

Detail

The I segment fetches the instruction from program memory. The instruction is decoded and an ALU operation is performed in the A segment.
The ALU is used for three different functions, depending on the decoded instruction.
It performs an operation for a data manipulation instruction, it evaluates the effective address for a load or store instruction, or it calculates the branch address for a program control instruction.
The E segment directs the output of the ALU to one of three destinations, depending on the decoded instruction.

41 of 42

Delayed Load

Consider now the operation of the following four instructions :
1. LOAD: R1 <--M [address 1]
2. LOAD: R2 <--M [address 2]
3. ADD: R3 <--R1 + R2
4. STORE: M[address 3] <--R3