1 of 17

The Processor�Introduction to Pipelining

Chapter 4

Sections: 4.6

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 17

Outline

Idea of Pipelining
Pipelining Performance
Pipelining Datapath

2

Prof. Iyad Jafar

3 of 17

Idea of Pipelining

3

Prof. Iyad Jafar

4 of 17

Introduction

Single-cycle datapath

Simple!
Hardware replication?
Cycle time?

Multi-cycle datapath

More involved
Less HW replication of major units
Better performance if the delay of major functional units is balanced!

Can we do any better?

Pipelining!

4

Prof. Iyad Jafar

5 of 17

Idea of Pipelining

In Multi-cycle, only one major unit is used in each cycle while other units might be idle!

Why not to use them to do something else?

Basically, start the next instruction before the current one is finished!

5

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

IFetch

Dec

Exec

Mem

WB

LW

Cycle 7

Cycle 6

Cycle 8

SW

IFetch

Dec

Exec

Mem

WB

R-Type

IFetch

Dec

Exec

Mem

WB

IFetch

Dec

Exec

Mem

WB

IFetch

Dec

Exec

Mem

WB

Prof. Iyad Jafar

6 of 17

Idea of Pipelining

The time required to execute one instruction (Instruction latency) is not affected.
However, the number of instructions finished per unit time (Throughput) is increased.
Thus, Pipelining improves the throughput not latency.

Most modern processors are pipelined!

However, in pipelining:

Similar to multi-cycle: the cycle time is determined by the slowest unit.
Similar to single-cycle: ideally, we can get one instruction done every cycle (CPI=1).
It is assumed that all instructions take the same number of cycles.

6

Prof. Iyad Jafar

7 of 17

Comparison

7

Multiple Cycle Implementation:

Clk

Cycle 1

IFetch

Dec

Exec

Mem

WB

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

Cycle 8

Cycle 9

Cycle 10

IFetch

Dec

Exec

Mem

lw

sw

IFetch

R-type

lw

IFetch

Dec

Exec

Mem

WB

Pipeline Implementation:

IFetch

Dec

Exec

Mem

WB

sw

IFetch

Dec

Exec

Mem

WB

R-type

Clk

Single Cycle Implementation:

lw

sw

Waste

Cycle 1

Cycle 2

R-type

Prof. Iyad Jafar

8 of 17

Pipelining Performance

8

Prof. Iyad Jafar

9 of 17

Why Pipelining?

9

I

n

s

t

r.

O

r

d

e

r

Time (clock cycles)

Inst 1

Inst 2

Inst 3

Inst 5

Inst 4

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 (similar to Single-cycle)

Time to fill the pipeline

Prof. Iyad Jafar

10 of 17

Example 1

Consider a program that consists of a large number of LOAD instructions only that is executed on a single-cycle CPU and 5-stage pipelined CPU with the operation time for the major units (memory, ALU, and register file) to be 200 ps in both cases.

Determine the time required to finish executing 1,000,000 LOAD instructions and compute the speed up of pipelining.
Determine the time required to finish executing the first 3 LOAD instructions.
Repeat (1) and (2) if the delay of the register file is 100 ps instead of 200 ps.

10

Prof. Iyad Jafar

11 of 17

Example 1

Determine the time required to finish executing 1,000,000 LOAD instructions and compute the speed up of pipelining

11

Single-cycle

Time_SC = 1000 ps x 1000000 = 1,000,000,000 ps

Pipelining

Time_PP = 1000 ps + 200 ps x 999999 = 200,000,800 ps

Speeup = 1,000,000,000 / 200,000,800 = 4.99998

(very close to the number of stages)

Prof. Iyad Jafar

12 of 17

Example 1

Determine the time required to finish executing the first 3 LOAD instructions and compute the speed up of pipelining

12

Pipelining

Single-cycle

Time_SC = 1000 x 3 = 3000 ps

Time_PP = 200 x 5 +200 + 200 = 1400 ps

Speeup = 3000 / 1400 = 2.14

(less than the number of stages)

Prof. Iyad Jafar

13 of 17

Example 1

Repeat (1) and (2) if the delay of the register file is 100 ps

CC_SC = 200 + 100 + 200 + 200 + 100 = 800 ps

CC_PP = 200 ps

13

For 1,000,000 instructions

Time_SC = 800 x 1,000,000 = 800,000,000 ps

Time_PP = 1000+ 200x999,999 = 200,000,800ps

Speeup = 800,000,000/ 200,000,600 = 3.99998 (<5)

For 3 instructions

Time_SC = 800 x 3 = 2400 ps

Time_PP = 1000 + 200x 2 = 1400 ps

Speeup = 2400/ 1400 = 1.71 (<5)

Prof. Iyad Jafar

14 of 17

Example 1 Summary

14

Prof. Iyad Jafar

15 of 17

Pipelined Datapath

15

Prof. Iyad Jafar

16 of 17

What do we need to implement pipelining?

The execution of instructions is divided into 5 stages (cycles): Instruction fetch (IF) , Instruction decode (ID), Execute (EX), Memory Access (MEM), Write Back (WB)
Instruction flow is from left to right except in two cases:

In the write-back stage where the result is written into the register file in the middle of the datapath.
Choosing between the incremented PC and the branch address in the MEM stage.

In pipelining, all units are operating in every cycle; thus we have to duplicate hardware where needed.
Since the execution is over multiple cycles, we need to add State (Pipeline) registers between stages to preserve intermediate data and control for each instruction.

These registers hold the values to be used in later stages as long as they are needed.

16

Prof. Iyad Jafar

17 of 17

Pipelined Datapath and Control

17

Prof. Iyad Jafar