1/56
Computer Architecture
Chapter 8: Pipeline
Outline
2/56
●
Introduction
●
Hazards
A way of speeding up execution of instructions
Computer architecture relies on pipelines to process multiple instructions faster.
Key idea:
“Overlap execution of multiple instructions”
3/56
What is Pipelining?
●
●
●
4/56
The Laundry Analogy
5/56
The Laundry Analogy
●
If we do laundry sequentially...
6/56
The Laundry Analogy
●
To Pipeline, overlap tasks are used;
7/56
What is Pipelining? => Overlap Tasks
8/56
Pipelining a Digital System
●
Key idea: break big computation up into pieces
●
Separate each piece with a pipeline register
Non-pipelined:
1 operation finishes
every 1 ns
Pipelined:
1 operation finishes
every 200 ps
9/56
Limitations of Pipeline
●
Pipelining increases throughput, but not latency
●
Computations must be divisible into stage size Pipeline registers add overhead
●
10/56
Process in Pipeline
●
Recall the 5 steps in instruction execution:
1) Instruction fetch (IF) - Receiving new instructions from main memory or Cache.
2) Instruction decode and register read (ID) - Decoding the instructions with Register File to let the ALU know what to calculate next.
3) Execution operation or calculate address (EX) - Calculating according to the instructions of Opcode and Operands by the ALU.
4) Memory access (MEM) - Accessing memory for reading or writing data.
5) Write result into register (WB) - Writing the result of the processing
back to the register file
●
In the case of Single-Cycle Processor, all five steps are executed in a Single Clock Cycle, with hardware performing each step.
11/56
Hardware for Single-Cycle Processor
12/56
Basic Pipeline for MIPS
13/56
Basic Pipelined Processor
14/56
Stage 1: Instruction Fetch (IF)
15/56
Stage 1: Instruction Fetch (IF)
●
Fetch an instruction from memory every cycle
●
Write state to the pipeline register (IF/ID)
– The next stage will read this pipeline register
16/56
Stage 2: Instruction Decode (ID)
17/56
Stage 2: Instruction Decode (ID)
●
Decodes opcode bits
– Set up Control signals for later stages
●
Read input operands from register file
– Specified by decoded instruction bits
●
Write state to the pipeline register (ID/EX)
18/56
Stage 3: Execution (EX)
Perform ALU operations
19/56
Stage 3: Execution (EX)
●
●
Write state to the pipeline register (EX/Mem)
20/56
Stage 4: Memory (MEM)
21/56
Stage 4: Memory (MEM)
●
Perform data cache access
●
Write state to the pipeline register (Mem/WB)
22/56
Stage 5: Write-back (WB)
23/56
Stage 5: Write-back (WB)
●
Writing result to register file (if required)
24/56
Single-Cycle vs Pipelining
25/56
Single-Cycle vs Pipelining
From the following instruction processing figure, if the CPU processes the instruction 3 times, find the increased processing speed (Speedup) comparing the Non-pipelined VS Pipelined operation. �Note: Pipeline operation has a register delay per instruction of 0.3 ns.
26/56
Pipeline Hazards
●
Instructions interfere with each other => hazards
- Example 1: different instructions may need the same piece of hardware (e.g., memory) in same clock cycle
27/56
●
Instructions interfere with each other => hazards
- Example 2: instruction may require a result produced by an earlier instruction that is not yet complete
Pipeline Hazards
28/56
Pipeline Hazards
●
Where one instruction cannot immediately follow another Types of hazards
Can always resolve hazards by waiting
●
●
Structural Hazards
29/56
●
Attempt to use the same resource by two or more instructions at the same time
●
Example: Single Memory for instructions and data
●
Solutions
=> Real pipelined processors have separate caches
Pipelined Example
30/56
●
Executing Multiple Instructions
Consider the following instruction sequence: lw $r0, 10($r1)
sw $r3, 20($r4) add $r5, $r6, $r7 sub $r8, $r9, $r10
●
31/56
Executing Multiple Instructions
●
Clock Cycle 1
32/56
Executing Multiple Instructions
●
Clock Cycle 2
33/56
Executing Multiple Instructions
●
Clock Cycle 3
34/56
Executing Multiple Instructions
●
Clock Cycle 4
35/56
Executing Multiple Instructions
●
Clock Cycle 5
36/56
Executing Multiple Instructions
●
Clock Cycle 6
37/56
Executing Multiple Instructions
●
Clock Cycle 7
38/56
Executing Multiple Instructions
●
Clock Cycle 8
Multicycle Diagram
39/56
Multicycle Diagram
40/56
41/56
Stall on Structural Hazards
42/56
Solutions for Structural Hazards
●
Stall
●
Pipeline hardware resource (Separate Resource)
●
Replicate resource
43/56
Solutions for Structural Hazards
Data Hazards
44/56
●
Data hazards occur when data is used before it is ready
Example: The use of the result of the SUB instruction in the next three instructions causes a data hazard, since the register $2 is not written until after those instructions read it.
●
Data Hazards
45/56
●
Read After Write (RAW)
●
Write After Read (WAR)
●
Write After Write (WAW)
Data Hazards
46/56
●
Read After Write (RAW)
●
Write After Read (WAR)
●
Write After Write (WAW)
Can’t happen in MIPS 5 stages pipeline because all instructions take 5 stages
47/56
Solutions for Data Hazards
●
Stalling Forwarding
– connect new value directly to next stage
Reordering
●
●
48/56
Solutions for Data Hazards
49/56
Stall on Data Hazards
50/56
Data Hazards - Forwarding
●
Key idea: connect new value directly to next stage Still read s0, but ignore in favor of new result
●
51/56
Data Hazards – Stall & Forwarding
●
STALL still required for load - data available after MEM MIPS architecture calls delayed load.
●
52/56
Control Hazards
●
A control hazard is when we need to find the destination of a branch, and can’t fetch any new instructions until we know that destination.
●
A branch is either
53/56
Control Hazards
***When branch occurs, flush three instructions
54/56
Solutions for Control Hazard
●
Stall
- stop loading instructions until result is available
●
Predict
●
Delayed branch
- specify in architecture that the instruction immediately following branch is always executed
55/56
Solutions for Control Hazard
56/56
Control Hazards - Stall
57/59
Control Hazard - Prediction
●
If prediction is correct
58/59
Control Hazard - Prediction
●
If prediction is wrong => lose cycles
59/59
Control Hazard - Delayed Branches
●
Delayed branches – code rearranged by compiler to place independent instruction after every branch (in delay slot).