1 of 59

1/56

Computer Architecture

Chapter 8: Pipeline

2 of 59

Outline

2/56

●

Introduction

Defining Pipelining
Pipelining Instructions

●

Hazards

Structural hazards
Data Hazards
Control Hazards

In pipelining, hazards are conditions that cause delays or incorrect execution of instructions. There are three main types of hazards:

Structural Hazards

Occur when hardware resources (such as memory, registers, or functional units) are insufficient to handle all pipeline stages simultaneously.
Example: A processor with a single memory unit trying to fetch an instruction and load data at the same time.

Data Hazards

Occur when instructions depend on the results of previous instructions that haven’t completed yet.
Types of data hazards:

Read After Write (RAW): A dependent instruction tries to read a register before the previous instruction writes to it.
Write After Read (WAR): A later instruction writes before an earlier instruction reads (rare in most architectures).
Write After Write (WAW): Two instructions attempt to write to the same register out of order.

Control Hazards (Branch Hazards)

Occur when the pipeline does not know the outcome of a branch instruction (e.g., an if condition or a loop).
Example: A conditional jump causes the pipeline to fetch incorrect instructions.

Solutions to Hazards

Structural Hazards: Use more hardware resources (e.g., separate memory for instruction and data).
Data Hazards: Use forwarding (bypassing), pipeline stalls, or instruction reordering.
Control Hazards: Use branch prediction, delay slots, or speculative execution.

3 of 59

A way of speeding up execution of instructions

Computer architecture relies on pipelines to process multiple instructions faster.

Key idea:

“Overlap execution of multiple instructions”

3/56

What is Pipelining?

●

4 of 59

4/56

The Laundry Analogy

5 of 59

5/56

The Laundry Analogy

●

If we do laundry sequentially...

6 of 59

6/56

The Laundry Analogy

●

To Pipeline, overlap tasks are used;

7 of 59

7/56

What is Pipelining? => Overlap Tasks

Pipeline is the concept of processing multiple tasks simultaneously without waiting for them to complete (Multiple tasks operating simultaneously).
Pipelines do not reduce the latency of a single instruction, but increase the throughput of the entire process.
Pipeline rate is limited by the slowest sub step.
The speedup depends on the number of steps in each pipeline or instruction.
If the length of each pipe is not the same, it will affect the throughput (Time to “fill” the pipeline and time to “drain” it reduces speedup).

8 of 59

8/56

Pipelining a Digital System

●

Key idea: break big computation up into pieces

●

Separate each piece with a pipeline register

Non-pipelined:

1 operation finishes

every 1 ns

Pipelined:

1 operation finishes

every 200 ps

9 of 59

9/56

Limitations of Pipeline

●

Pipelining increases throughput, but not latency

Operating in every 200 ps,
BUT A single computation still takes 1 ns

●

Computations must be divisible into stage size Pipeline registers add overhead

●

Computations must be divisible into stage size (Blue text)

Pipelining divides a task into smaller stages, and each stage must take approximately the same amount of time to maintain efficiency.
If a computation cannot be evenly split, some pipeline stages may be underutilized, causing inefficiencies.
Example: If one stage takes significantly longer than others, it creates a bottleneck, reducing overall pipeline performance.

Pipeline registers add overhead (Red text)

Between each stage of the pipeline, pipeline registers store intermediate results.
These registers require extra hardware and introduce latency due to data transfer.
The more pipeline stages there are, the more registers are needed, which increases overhead and may reduce overall gains.

Key Takeaway

To achieve optimal performance in a pipelined system, computations should be evenly distributed across stages while minimizing the overhead of pipeline registers.

10 of 59

10/56

Process in Pipeline

●

Recall the 5 steps in instruction execution:

1) Instruction fetch (IF) - Receiving new instructions from main memory or Cache.

2) Instruction decode and register read (ID) - Decoding the instructions with Register File to let the ALU know what to calculate next.

3) Execution operation or calculate address (EX) - Calculating according to the instructions of Opcode and Operands by the ALU.

4) Memory access (MEM) - Accessing memory for reading or writing data.

5) Write result into register (WB) - Writing the result of the processing

back to the register file

●

In the case of Single-Cycle Processor, all five steps are executed in a Single Clock Cycle, with hardware performing each step.

11 of 59

11/56

Hardware for Single-Cycle Processor

The Basic Pipeline for MIPS (Microprocessor without Interlocked Pipeline Stages) consists of five stages, which allow overlapping execution of multiple instructions for improved performance. The stages are:

Instruction Fetch (IF)

The instruction is fetched from memory using the Program Counter (PC).
The PC is updated to point to the next instruction.

Instruction Decode (ID)

The fetched instruction is decoded.
Register operands are read from the register file.
Control signals are generated for the execution stage.

Execute (EX)

The Arithmetic Logic Unit (ALU) performs computations (e.g., addition, subtraction, shifts).
Branch conditions are evaluated.
Memory address computation occurs (if needed).

Memory Access (MEM)

If the instruction involves memory (e.g., lw - load word or sw – store word), data is read from or written to memory.

Write Back (WB)

The result of the computation (from ALU or memory) is written back to the register file.

Advantages of MIPS Pipeline

Increases instruction throughput by executing multiple instructions in parallel.
Reduces execution time compared to sequential execution.
Efficient use of hardware resources.

12 of 59

12/56

Basic Pipeline for MIPS

13 of 59

13/56

Basic Pipelined Processor

14 of 59

14/56

Stage 1: Instruction Fetch (IF)

15 of 59

15/56

Stage 1: Instruction Fetch (IF)

●

Fetch an instruction from memory every cycle

Use PC to index memory
Increment PC (assume no branches for now)

●

Write state to the pipeline register (IF/ID)

– The next stage will read this pipeline register

16 of 59

16/56

Stage 2: Instruction Decode (ID)

17 of 59

17/56

Stage 2: Instruction Decode (ID)

●

Decodes opcode bits

– Set up Control signals for later stages

●

Read input operands from register file

– Specified by decoded instruction bits

●

Write state to the pipeline register (ID/EX)

Opcode
Register contents
PC+1 (even though decode didn’t use it)
Control signals for opcode and destination register

18 of 59

18/56

Stage 3: Execution (EX)

19 of 59

Perform ALU operations

Calculate result of instruction

Control signals select operation
Contents of regA used as one input
Either regB or constant offset used as second input

Calculate PC-relative branch target; PC+1+(constant offset)

19/56

Stage 3: Execution (EX)

●

Write state to the pipeline register (EX/Mem)

ALU result, contents of regB, and PC+1+offset
Control signals for opcode and destination register

20 of 59

20/56

Stage 4: Memory (MEM)

21 of 59

21/56

Stage 4: Memory (MEM)

●

Perform data cache access

ALU result contains address for LD (LOAD) or ST (STORE)
Opcode bits control R/W and enable signals

●

Write state to the pipeline register (Mem/WB)

ALU result and Loaded data
Control signals for opcode and destination register

22 of 59

22/56

Stage 5: Write-back (WB)

23 of 59

23/56

Stage 5: Write-back (WB)

●

Writing result to register file (if required)

Write Loaded data to destination register for LD
Write ALU result to destination register for arithmetic instruction
Opcode bits control register write enable signal

24 of 59

24/56

Single-Cycle vs Pipelining

25 of 59

25/56

Single-Cycle vs Pipelining

From the following instruction processing figure, if the CPU processes the instruction 3 times, find the increased processing speed (Speedup) comparing the Non-pipelined VS Pipelined operation. �Note: Pipeline operation has a register delay per instruction of 0.3 ns.

26 of 59

26/56

Pipeline Hazards

●

Instructions interfere with each other => hazards

- Example 1: different instructions may need the same piece of hardware (e.g., memory) in same clock cycle

27 of 59

27/56

●

Instructions interfere with each other => hazards

- Example 2: instruction may require a result produced by an earlier instruction that is not yet complete

Pipeline Hazards

28 of 59

28/56

Pipeline Hazards

●

Where one instruction cannot immediately follow another Types of hazards

Structural hazards: two different instructions use same hardware in same cycle
Data hazards: Instruction depends on result of prior instruction still in the pipeline
Control hazards: Pipelining of branches & other instructions that change the PC

Can always resolve hazards by waiting

●

29 of 59

Structural Hazards

29/56

●

Attempt to use the same resource by two or more instructions at the same time

●

Example: Single Memory for instructions and data

Accessed by IF stage
Accessed at same time by MEM stage

●

Solutions

Delay the second access by one clock cycle, OR
Provide separate memories for instructions & data

=> Real pipelined processors have separate caches

30 of 59

Pipelined Example

30/56

●

Executing Multiple Instructions

Consider the following instruction sequence: lw $r0, 10($r1)

sw $r3, 20($r4) add $r5, $r6, $r7 sub $r8, $r9, $r10

●

The MIPS instruction lw $r0, 10($r1) means:

Breakdown of the Instruction

lw → Load Word (loads a 32-bit word from memory into a register).
$r0 → Destination register (where the loaded value will be stored).
10($r1) → Memory address from which data is loaded.

10 → Offset (a constant value).
($r1) → Base register (contains the starting memory address).

Meaning in Steps

Compute the memory address:
Effective Address=Value in r1+10\text{Effective Address} = \text{Value in } r1 + 10Effective Address=Value in r1+10
Load the 32-bit word from that computed memory address.
Store the loaded value into register $r0.

Important Note

Register $r0 in MIPS is always hardwired to zero, meaning this instruction is incorrect because you cannot modify $r0.
Instead, a valid instruction would be lw $r2, 10($r1), where $r2 is a general-purpose register.

31 of 59

31/56

Executing Multiple Instructions

●

Clock Cycle 1

✅ 1. IF Stage — LW is being fetched

Instruction: lw $r0, 10($r1)

This is the ONLY instruction active at Cycle 1.�Here is what happens:

✅ Step 1: PC sends an address to Instruction Memory

The PC output is highlighted in orange.

This fetches the lw instruction.

✅ Step 2: Instruction Memory reads lw

The Instruction Memory block is shaded orange.

Output RD sends the lw instruction into the pipeline.

✅ Step 3: PC + 4 is calculated

The ADD block (PC incrementer) is highlighted.

It computes:

next_PC = PC + 4

✅ Step 4: The fetched lw instruction goes into IF/ID

The IF/ID pipeline register is shaded in orange.

This register will hold the lw instruction for Cycle 2.

✅ Step 5: Control lines for other stages are idle

ID/EX, EX/MEM, MEM/WB are white (inactive)

No ALU, no memory, no register writes yet

32 of 59

32/56

Executing Multiple Instructions

●

Clock Cycle 2

lw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

✅ 1. IF Stage (Instruction Fetch)

Instruction: sw $r3, 20($r4)

Shown in blue in your picture.

What happens in hardware:

PC sends address to Instruction Memory

Instruction Memory reads the SW instruction

The “PC + 4” adder computes the address of the next instruction

The fetched sw instruction is written into the IF/ID pipeline register

In the picture:

Blue PC

Blue Instruction Memory

Blue path into IF/ID register

This exactly matches the IF behavior.

�

✅ 2. ID Stage (Instruction Decode)

Instruction: lw $r0, 10($r1)

Shown in orange in your picture.

What happens in hardware:

Register File reads operands:

$r1 → RD1 (base address)

$r0 (as write register—although lw writes in WB stage)

Immediate 10 is sign-extended from 16 bits to 32 bits

Control signals for a load instruction are generated

All these values are placed into the ID/EX pipeline register

In the picture:

Orange shading over Register File

Orange wires carrying RN1, RN2, immediate, RD1, RD2

Orange EXTND block

Orange ID/EX pipeline register

This exactly matches how lw behaves in ID stage.

33 of 59

33/56

Executing Multiple Instructions

●

Clock Cycle 3

lw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

✅ 1. IF Stage (Instruction Fetch)

Instruction: add $r5, $r6, $r7

In the picture (left side, light blue):

PC sends the address into Instruction Memory

Instruction Memory outputs the ADD instruction

“ADD +4” updates the PC for next instruction

Result goes into the IF/ID pipeline register

This matches the light blue path in the picture.

�

✅ 2. ID Stage (Decode)

Instruction: sw $r3, 20($r4)

You see the Register File in blue, meaning:

RN1 = register number of $r4

RN2 = register number of $r3

RD1 outputs the value of $r4

RD2 outputs the value of $r3

The immediate value 20 goes into the EXTND block (sign-extension)

Control signals for a store instruction are prepared.

This matches the blue Register File and blue wires in your picture.

�

✅ 3. EX Stage (Execute)

Instruction: lw $r0, 10($r1)

This is the orange section of the diagram.

Inside EX stage for lw:

ALU gets:

RD1 = value of $r1

Immediate = 10 (extended)

ALU computes:

effective address = r1 + 10

That address is stored into EX/MEM pipeline register

What you see in the picture:

Orange ALU active

Orange ID/EX → EX/MEM pipeline register

Orange immediate extension lines

Store/Load path highlighted

34 of 59

34/56

Executing Multiple Instructions

●

Clock Cycle 4

lw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

This means:

lw is in MEM

sw is in EX

add is in ID

sub is in IF

Now we connect these to the picture.

✅ 1. IF Stage (Yellow)

Instruction: sub $r8, $r9, $r10

In this stage:

Instruction Memory is reading the SUB instruction

PC sends the address to Instruction Memory

The “ADD +4” block computes the next PC

The fetched SUB instruction is placed into IF/ID pipeline register

This matches the left yellow portion of your diagram.

�

✅ 2. ID Stage (Teal)

Instruction: add $r5, $r6, $r7

This stage uses the teal-colored Register File block:

RN1 = register number of $r6

RN2 = register number of $r7

RD1 outputs the value of $r6

RD2 outputs the value of $r7

Since ADD is R-type, EX control signals are prepared

Values move into the ID/EX pipeline register

This is the big Register File block in the center.

�

✅ 3. EX Stage (Blue)

Instruction: sw $r3, 20($r4)

This is the blue section:

The ALU computes address = $r4 + 20

The immediate field is extended (EXTND block)

The ADD or shift-left logic isn't used for SW

The store data $r3 moves along the RD2 path into EX/MEM

The result is stored in the EX/MEM pipeline register (blue vertical rectangle).

This exactly matches the blue ALU + blue EX/MEM in your picture.

�

✅ 4. MEM Stage (Orange)

Instruction: lw $r0, 10($r1)

This is the orange portion:

ALU output from previous cycle gives the address = $r1 + 10

Data Memory reads the word at this address (blue input = ADDR)

The memory output RD (data word) appears on the right side

MEM/WB will carry this value forward into WB in cycle 5

Matches the Data Memory block in orange.

35 of 59

35/56

Executing Multiple Instructions

●

Clock Cycle 5

lw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

✅ Stage 1: IF (Instruction Fetch)

sub $r8, $r9, $r10

In the picture:

Instruction Memory outputs SUB into the IF/ID register

PC is being updated (PC + 4)

�

✅ Stage 2: ID (Decode)

add $r5, $r6, $r7

In your picture:

Register File reads $r6 and $r7 via RN1/RN2

Immediate is not used (add is R-type)

Operands flow to ID/EX pipeline register

�

✅ Stage 3: EX (Execute)

sw $r3, 20($r4)

Inside EX stage:

ALU computes effective memory address

MemAddr = $r4 + 20

EX/MEM register stores:

Computed address (to go to memory)

$r3 (data to be stored)

Control signals for memory write

This matches the ALU block and the blue EX/MEM box in the picture.

�

✅ Stage 4: MEM (Memory Access)

lw $r0, 10($r1)

Inside MEM stage:

The address from EX stage is sent to Data Memory

Memory reads a 32-bit word

Result passes to MEM/WB register

In the diagram:

Blue wiring → Data Memory

Orange wiring → Write-back path

�

✅ Stage 5: WB (Write Back)

lw $r0, 10($r1) is about to finish WB

Memory output RD goes through a MUX

Value is written into Register File via WD line

Register Number WN = $r0

This matches the rightmost orange MEM/WB block in your picture.

36 of 59

36/56

Executing Multiple Instructions

●

Clock Cycle 6

lw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

✅ 1. EX Stage (Yellow) → sub $r8, $r9, $r10

The yellow ALU block on the right side is active.

In EX stage:

ALU receives RD1 = value of $r9

RD2 = value of $r10

ALU performs subtraction:

r9 - r10

Output goes into EX/MEM pipeline register

Zero flag possibly generated (not used here)

Your picture highlights:

Yellow ALU

Yellow EX/MEM register

Which matches the SUB instruction in EX.

�

✅ 2. MEM Stage (Teal) → add $r5, $r6, $r7

ADD is an R-type instruction, so MEM stage does not access memory.

What happens here:

The ALU result from the previous cycle (r6 + r7) moves through MEM stage untouched

No memory read/write

Value is forwarded into MEM/WB next cycle

Your picture shows:

Teal EX/MEM → Data Memory paths (unused for ADD)

Teal pipeline register active

This correctly represents ADD in MEM stage.

�

✅ 3. WB Stage (Blue) → sw $r3, 20($r4)

Important:�SW writes to memory in MEM stage.�SW does not write to a register, so WB does nothing.

At WB stage for SW:

Control signals cause NO write to Register File

The MEM/WB register is simply passing through the pipeline

The blue shading on MEM/WB confirms SW has reached WB

Your picture shows:

Blue MEM/WB register column

Blue output paths (but no register write happens)

This matches SW finishing its pipeline.

�

✅ 4. IF / ID stages are empty

Because all instructions are already deep inside the pipeline.

The picture shows:

White IF/ID

White ID/EX

Which is correct for Cycle 6.

37 of 59

37/56

Executing Multiple Instructions

●

Clock Cycle 7

✅ 1. MEM Stage (Yellow)

Instruction: sub $r8, $r9, $r10

SUB is an R-type instruction.�It does not access Data Memory.

What happens in MEM:

The ALU result (r9 – r10) simply passes through the MEM stage

No read or write to the Data Memory

The result is placed into the MEM/WB pipeline register for the next stage

Your picture highlights:

Yellow EX/MEM register

Yellow path into MEM/WB

Yellow lines entering the Data Memory block (but memory is not used)

This correctly shows the SUB instruction inside MEM.

�

✅ 2. WB Stage (Teal)

Instruction: add $r5, $r6, $r7

In WB:

The result of the ADD instruction is written back into the Register File

Destination register is $r5

Control signals enable register write

Your picture shows:

Teal MEM/WB register column

Teal write-back path returning to the Register File

Teal shading inside the Register File write port (WN, WD)

This matches ADD completing its final stage.

�

✅ 3. EX, ID, IF stages are empty

The pipeline is almost drained.

Your picture confirms this:

White IF/ID

White ID/EX

White ALU input area

Only MEM and WB are active

38 of 59

38/56

Executing Multiple Instructions

●

Clock Cycle 8

lw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

✅ 1. WB Stage (Yellow)

Instruction: sub $r8, $r9, $r10

SUB is an R-type instruction, so its final step is:

✅ Write ALU result back into Register File�Destination = $r8

What the picture shows:

Yellow MEM/WB pipeline register

Yellow write-back MUX

Yellow WD and WN lines entering the Register File

This indicates that the value from the ALU (r9 – r10) is being written into:

Register $r8

This is the final instruction completing the pipeline.

�

✅ 2. EX, MEM, ID, IF stages are empty

Correct, because all other instructions have already passed through:

lw finished at cycle 5

sw finished at cycle 6

add finished at cycle 7

sub is finishing now at cycle 8

Your diagram shows:

IF/ID is white

ID/EX is white

EX/MEM is white

Data Memory is not active

Only MEM/WB is highlighted

Perfect alignment with Cycle 8.

39 of 59

Multicycle Diagram

39/56

40 of 59

Multicycle Diagram

40/56

41 of 59

41/56

Stall on Structural Hazards

42 of 59

42/56

Solutions for Structural Hazards

●

Stall

low cost, simple
use for rare case since stalling has performance effect

●

Pipeline hardware resource (Separate Resource)

useful for multi-cycle resources
good performance
complex

●

Replicate resource

good performance
increases cost (+ maybe interconnect delay)
useful for cheap or divisible resources

Solutions for Structural Hazards

Structural hazards occur when the hardware does not have enough resources to execute multiple instructions in different pipeline stages at the same time. There are three main solutions to this problem:

1. Stall (Pipeline Stall)

Also called "bubbles", this method delays instruction execution until the required resource is available.
The pipeline temporarily stops fetching or executing new instructions, introducing idle cycles.
Example:

If a memory unit is being used for a load instruction, another instruction needing memory access (like store) must wait.
The pipeline control unit inserts a stall cycle before resuming execution.

📌 Pros: Simple and easy to implement.�📌 Cons: Reduces overall performance due to idle cycles.

2. Pipeline Hardware Resource (Separate Resources)

Instead of sharing a single resource, hardware is modified to allow simultaneous access by different instructions.
Example:

Split memory into instruction memory and data memory so that instruction fetch (IF) and memory access (MEM) can happen in parallel.
Use separate execution units for different operations (e.g., separate ALUs for integer and floating-point calculations).

📌 Pros: Reduces conflicts and improves performance.�📌 Cons: Increases hardware complexity and cost.

3. Replicate Resource (Multiple Functional Units)

Instead of having just one resource, the system duplicates key hardware components to allow multiple instructions to execute at the same time.
Example:

Using multiple ALUs so different arithmetic operations (like addition and multiplication) can be performed in parallel.
Adding multiple memory banks to allow multiple memory accesses without delay.

📌 Pros: Maximizes performance by eliminating resource conflicts.�📌 Cons: Increases hardware cost and power consumption.

43 of 59

43/56

Solutions for Structural Hazards

Structural hazards occur when the hardware does not have enough resources to execute multiple instructions in different pipeline stages at the same time. There are three main solutions to this problem:

1. Stall (Pipeline Stall)

Also called "bubbles", this method delays instruction execution until the required resource is available.
The pipeline temporarily stops fetching or executing new instructions, introducing idle cycles.
Example:

If a memory unit is being used for a load instruction, another instruction needing memory access (like store) must wait.
The pipeline control unit inserts a stall cycle before resuming execution.

📌 Pros: Simple and easy to implement.�📌 Cons: Reduces overall performance due to idle cycles.

2. Pipeline Hardware Resource (Separate Resources)

Instead of sharing a single resource, hardware is modified to allow simultaneous access by different instructions.
Example:

Split memory into instruction memory and data memory so that instruction fetch (IF) and memory access (MEM) can happen in parallel.
Use separate execution units for different operations (e.g., separate ALUs for integer and floating-point calculations).

📌 Pros: Reduces conflicts and improves performance.�📌 Cons: Increases hardware complexity and cost.

3. Replicate Resource (Multiple Functional Units)

Instead of having just one resource, the system duplicates key hardware components to allow multiple instructions to execute at the same time.
Example:

Using multiple ALUs so different arithmetic operations (like addition and multiplication) can be performed in parallel.
Adding multiple memory banks to allow multiple memory accesses without delay.

📌 Pros: Maximizes performance by eliminating resource conflicts.�📌 Cons: Increases hardware cost and power consumption.

44 of 59

Data Hazards

44/56

●

Data hazards occur when data is used before it is ready

Example: The use of the result of the SUB instruction in the next three instructions causes a data hazard, since the register $2 is not written until after those instructions read it.

●

45 of 59

Data Hazards

45/56

●

Read After Write (RAW)

●

Write After Read (WAR)

●

Write After Write (WAW)

46 of 59

Data Hazards

46/56

●

Read After Write (RAW)

●

Write After Read (WAR)

●

Write After Write (WAW)

Can’t happen in MIPS 5 stages pipeline because all instructions take 5 stages

47 of 59

47/56

Solutions for Data Hazards

●

Stalling Forwarding

– connect new value directly to next stage

Reordering

●

Solutions for Data Hazards

Data hazards occur when instructions depend on the results of previous instructions that haven’t completed yet. These hazards can be resolved using the following techniques:

1. Stall (Pipeline Stall)

The simplest method to resolve data hazards is to pause execution until the required data is available.
Also known as "inserting a bubble", this approach prevents the pipeline from moving forward until the dependency is resolved.
✅ Pros: Simple to implement.�❌ Cons: Reduces performance by introducing idle cycles.

2. Forwarding (Data Bypassing)

Instead of stalling, forwarding allows the pipeline to use intermediate results from later stages before they are officially written back to the register file.
The ALU output is forwarded directly to the next instruction that needs it.
✅ Pros: Improves efficiency, reduces stalls.�❌ Cons: Requires additional hardware (forwarding unit, multiplexers).

3. Reordering (Instruction Scheduling)

The compiler or hardware reorders instructions to avoid conflicts while maintaining correctness.
Independent instructions are moved before dependent ones to fill pipeline gaps.
✅ Pros: Reduces stalls, improves instruction-level parallelism.�❌ Cons: Requires compiler optimization or dynamic scheduling in hardware.

48 of 59

48/56

Solutions for Data Hazards

Data hazards occur when instructions depend on the results of previous instructions that haven’t completed yet. These hazards can be resolved using the following techniques:

1. Stall (Pipeline Stall)

The simplest method to resolve data hazards is to pause execution until the required data is available.
Also known as "inserting a bubble", this approach prevents the pipeline from moving forward until the dependency is resolved.
✅ Pros: Simple to implement.�❌ Cons: Reduces performance by introducing idle cycles.

2. Forwarding (Data Bypassing)

Instead of stalling, forwarding allows the pipeline to use intermediate results from later stages before they are officially written back to the register file.
The ALU output is forwarded directly to the next instruction that needs it.
✅ Pros: Improves efficiency, reduces stalls.�❌ Cons: Requires additional hardware (forwarding unit, multiplexers).

3. Reordering (Instruction Scheduling)

The compiler or hardware reorders instructions to avoid conflicts while maintaining correctness.
Independent instructions are moved before dependent ones to fill pipeline gaps.
✅ Pros: Reduces stalls, improves instruction-level parallelism.�❌ Cons: Requires compiler optimization or dynamic scheduling in hardware.

49 of 59

49/56

Stall on Data Hazards

✅ Instructions involved

add $s0, $t0, $t1

sub $t2, $s0, $t3 ← depends on $s0 from instruction 1

The problem:�sub needs to read $s0, but $s0 is not written yet.

�

✅ Why does the hazard happen?

✔ Instruction 1 (add) writes $s0

MIPS writes the register file during the WB stage.

From the timeline in the picture:

add computes the result in EX (clock 4)

It passes MEM (clock 6)

It writes $s0 in WB (clock 8 or 10 depending on diagram)�(The arrow “$s0 written here” points at this)

So, $s0 becomes valid only at the WB stage.

�

✅ When does the second instruction read $s0?

For instruction 2 (sub), the register operands are read during ID.

In the picture, the arrow says:

“$s0 read here”

This happens in the ID stage of sub.

❌ But the add instruction has not written $s0 yet

Therefore, sub would read the old, incorrect value.

✅ How does the pipeline solve this WITHOUT forwarding?

Since there is no forwarding, the pipeline must:

✅ Insert stalls (pipeline stops)

✅ Insert bubbles (NOP instructions)

✅ Delay the execution of sub

until $s0 is written back.

This is why the diagram shows:

“STALL”

“BUBBLE” (many bubbles inserted between stages)

These bubbles delay the sub instruction long enough so that:

✅ It reads $s0 only after the add writes $s0

When this happens:

add writes $s0 in WB

sub executes ID in the same clock cycle

This is what the picture shows with:

“$s0 written here”

“$s0 read here”

Meaning the read and write happen in the same cycle, which is allowed in MIPS.

50 of 59

50/56

Data Hazards - Forwarding

●

Key idea: connect new value directly to next stage Still read s0, but ignore in favor of new result

●

The value produced by the add instruction is needed by the sub instruction before the add instruction reaches the Write-Back (WB) stage.

Without forwarding, the pipeline would stall.

With forwarding, the value is taken early (from EX/MEM or MEM/WB pipeline registers).

This picture shows how forwarding sends the new $s0 value directly to the sub instruction.

�

✅ Step-by-step explanation of the timeline

✅ Instruction 1: add $s0, $t0, $t1

IF at cycle 0

ID at cycle 2

EX at cycle 4 → result for $s0 is produced here

MEM at cycle 6

WB at cycle 8 (writes $s0 into the register file)

✅ The new value of $s0 exists before WB.�✅ It becomes available right after EX (cycle 4).

�

✅ Instruction 2: sub $t2, $s0, $t3

IF at cycle 4

ID at cycle 6

EX at cycle 8 ← needs $s0 here

❌ Problem

In cycle 8, during the EX stage of sub, the register file still contains the old value of $s0.

✅ Solution: Forwarding

The pipeline takes the new value of $s0 from the MEM stage of add and “forwards” it to the EX stage of sub.

This is shown in the picture by:

A red arrow labeled “new value of $s0”

It flows from the output of the EX/MEM register of the add instruction (cycle 6)

Directly into the ALU input of the sub instruction (cycle 8)

✅ The red X mark

This X shows that the value from the register file is NOT used (because it is stale).�Instead, the forwarded value is used.

�

✅ Visual interpretation of the picture

✔ The add instruction finishes EX first

The ALU output (new $s0) is available.

✔ The sub instruction reaches EX later

It needs that ALU result.

✔ Forwarding network connects:

EX/MEM → ALU input of next instruction

So the pipeline does not stall.

�

✅ Why this works

Forwarding allows an instruction to use values as soon as the ALU computes them, rather than waiting until the register file is updated in WB.

This eliminates the need for pipeline stalls.

51 of 59

51/56

Data Hazards – Stall & Forwarding

●

STALL still required for load - data available after MEM MIPS architecture calls delayed load.

●

Microprocessor without Interlocked Pipeline Stages (MIPS)

✅ Instructions in the picture

lw $s0, 20($t1)

sub $t2, $s0, $t3 ← uses $s0 immediately after the load

This is the classic Load → Use hazard.

�

✅ Why this hazard is special

When using forwarding, ALU results can be forwarded immediately in the next cycle.

But a load produces its data later:

✔ For ALU instructions

Result is ready at the end of EX.

❌ For a load instruction

The loaded data is only ready at the end of MEM stage�(because memory must be accessed first).

So forwarding cannot get the value early enough.

✔ This is a Load-Use hazard

✔ Even with forwarding, you need 1 stall cycle

because:

Forwarding only works into EX stage

But sub needs $s0 in ID stage

So the pipeline inserts:

1 STALL

1 BUBBLE

Then forwarding provides the correct data in EX

This lets the sub instruction use the value without error.

52 of 59

52/56

Control Hazards

●

A control hazard is when we need to find the destination of a branch, and can’t fetch any new instructions until we know that destination.

●

A branch is either

Branch Taken: The program jumps to a new address (target address).

Branch Not Taken: The program continues sequentially (next instruction in memory).

53 of 59

53/56

Control Hazards

***When branch occurs, flush three instructions

54 of 59

54/56

Solutions for Control Hazard

●

Stall

- stop loading instructions until result is available

●

Predict

assume an outcome and continue fetching (undo if prediction is wrong)
lose cycles only on mis-prediction

●

Delayed branch

- specify in architecture that the instruction immediately following branch is always executed

Solutions for Control Hazards: Stall, Predict, and Delayed Branch

Control hazards occur when the processor does not know the outcome of a branch instruction and may fetch incorrect instructions. Three common solutions are used to minimize the impact of control hazards:

1. Stall (Pipeline Flush)

Concept:

The simplest method is to stop the pipeline (stall) until the branch decision is made.
If the branch is taken, the wrongly fetched instructions are flushed (removed), and the correct path is executed.

✅ Pros: Simple and easy to implement.�❌ Cons: Wastes CPU cycles, reducing performance.

2. Branch Prediction

Concept:

Instead of stalling, the processor guesses whether the branch will be taken or not.
If the guess is correct → No penalty.
If the guess is wrong → The incorrectly fetched instructions are discarded, causing a small stall.

📌 Types of Branch Prediction:

Static Prediction (Fixed Strategy)

Example: Always assume the branch is not taken (simpler but less efficient).

Dynamic Prediction (History-Based)

The CPU tracks past branch behavior and makes a decision based on history.
Example: If a branch was taken multiple times before, it predicts "taken" again.

✅ Pros: Improves performance when predictions are correct.�❌ Cons: If the prediction is wrong, performance suffers due to flushing.

3. Delayed Branch

Concept:

Instead of stalling, the processor always executes the next instruction after the branch, regardless of whether the branch is taken or not.
The compiler rearranges instructions so that useful work is done in the delay slot.

✅ Pros: No pipeline stalls if a useful instruction is placed in the delay slot.�❌ Cons: Requires compiler support and may not always be possible.

55 of 59

55/56

Solutions for Control Hazard

✅ 1. Stall (Flush)

Performance Impact:

Low (reduces speed)

The CPU must wait until the branch decision is known.

This introduces stalls, slowing down the pipeline.

Implementation Complexity:

Simple

The hardware just pauses the pipeline or flushes incorrect instructions.

Easiest to implement in hardware.

Meaning:�This is the “safe but slow” method.

�

✅ 2. Branch Prediction

Performance Impact:

High (minimizes stalls)

If prediction is correct, the pipeline keeps running smoothly.

Greatly improves performance.

Implementation Complexity:

Moderate to complex

Requires prediction hardware (like 1-bit, 2-bit predictors, BTB, BHT).

Needs extra circuits and logic.

Meaning:�This is the “smart and fast” method, but more complex.

�

✅ 3. Delayed Branch

Performance Impact:

High (avoids stalls)

Uses a "branch delay slot": the instruction after the branch is always executed.

No pipeline stall if the compiler fills the delay slot efficiently.

Implementation Complexity:

Requires compiler support

The compiler must rearrange instructions to fill the delay slot.

Hardware stays simple, but compiler logic becomes more complex.

Meaning:�This is the “compiler-optimized” method used in classic MIPS.

56 of 59

56/56

Control Hazards - Stall

57 of 59

57/59

Control Hazard - Prediction

●

If prediction is correct

58 of 59

58/59

Control Hazard - Prediction

●

If prediction is wrong => lose cycles

59 of 59

59/59

Control Hazard - Delayed Branches

●

Delayed branches – code rearranged by compiler to place independent instruction after every branch (in delay slot).