1 of 15

CS 61C

Discussion 7: Pipelining

John Yang

Summer 2019

2 of 15

Announcements

Today’s Goal

  • Midterm 2 next Monday!
  • Check In: tinyurl.com/john61c
  • Homework 5 due Wednesday
  • Project 3 due July 31st
  • Learn how to improve the throughput of the single cycle datapath design with the help of pipelining.
  • Understand what kinds of obstacles and challenges arise in a pipelined models that may interfere with the rate of correctness of execution.

3 of 15

Pipelining

A nifty technique for increasing the throughput of the processor datapath

4 of 15

Pipelining Motivation

  • Currently, in a single cycle datapath setting, we wait for a single instruction to run till completion (all 5 stages) before processing the next instruction.
    • Think Utilization! Only ⅕ stages is active at any time => 20% Utilization isn’t great!
  • What if instead of waiting for an instruction to finish completely, we passed an instruction into the datapath after the previous instruction finishes the IF stage?

5 of 15

Pipelining

In a setting with a CPU processing RISC-V code...

6 of 15

Pipelining

To make this work, we require more hardware, specifically more registers! These registers store the intermediate values between each stage.

Instruction Fetch

Decode

Execute

Memory

Write

7 of 15

Hazards

The transition from single cycle to pipelined CPU designs introduces problems with regards to correctly executing instructions in order

8 of 15

Structural Hazards

Two instructions in different stages attempt to use the same hardware simultaneously

  • Hardware usually refers to Register File and Memory
  • i.e. “lw” in Writeback, “add” in Instruction Decode both use the Register File

Solutions

  • Stall by inserting NOPs (no operation), slower pipeline
  • Insert more hardware + architecture! (More hardware always resolves structural hazards)

Registers

  • Writes occur on Rising Edge of Clock
  • Reads occur on Falling Edge of Clock
  • A.k.a. “Double Pumping”

Memory

  • Split Memory into IMEM and DMEM (exactly what original pipeline did!)

Clock

Write Data

Read Data

9 of 15

Data Hazards

Instruction relies on register with a new value that hasn’t been written back to the register by the time of retrieval

Example

add x9, x8, x7

add x10, x9, x8

Inst 1 writes to x9 after Inst 2 reads (Bad!)

1

2

3

4

5

6

7

8

9

Inst 1

IF

ID

EX

MEM

WB

Inst 2

IF

ID

EX

MEM

WB

x9 value retrieved

x9 modified

10 of 15

Data Hazards

First Solution: More Stalling! Insert NOPs that delays the next instruction until the register is altered. The necessity of the delay is realized in the decode stage

Example

add x9, x8, x7

add x10, x9, x8

Delay causes it to be almost sequential!

1

2

3

4

5

6

7

8

9

Inst 1

IF

ID

EX

MEM

WB

Inst 2

IF

ID

ID

ID

ID

EX

MEM

WB

x9 value retrieved

x9 modified

11 of 15

Data Hazards

Second Solution: First Solution + Double Pumping (recall Structural Hazards). Read/Write accomplished in one clock cycle

Example

add x9, x8, x7

add x10, x9, x8

One fewer NOP isn’t much...

1

2

3

4

5

6

7

8

9

Inst 1

IF

ID

EX

MEM

WB

Inst 2

IF

ID

ID

ID

EX

MEM

WB

x9 value retrieved

x9 modified

12 of 15

Data Hazards

Third Solution (The Best!): Forward the result from the Execute stage to the next instruction, specifically right before the result is actually used.

Example

add x9, x8, x7

add x10, x9, x8

1

2

3

4

5

6

7

8

9

Inst 1

IF

ID

EX

MEM

WB

Inst 2

IF

ID

EX

MEM

WB

x9 value retrieved

x9 modified

Forward here!

x9 register actually used!

Implementing the hardware for this is quite complicated, not something we’ll discuss

13 of 15

Data Hazards

Edge Case! For the “lw” instruction, the value is only ready by the MEM, not EX stage, so there’s an unavoidable NOP for data hazards starting with “lw”

Example

lw x9, 0(x5)

add x10, x9, x8

1

2

3

4

5

6

7

8

9

Inst 1

IF

ID

EX

MEM

WB

Inst 2

IF

ID

ID

EX

WB

x9 value retrieved

x9 modified

Forward here!

x9 register actually used!

14 of 15

Control Hazards

For branch + jump instructions, we don’t know if the next instruction address is PC+4 or PC+Imm until the ALU stage.

  • Execute Stage: ALU calculates PC+Imm
  • Instruction Fetch requires address to retrieve next instruction

Example

loop: beq x9, x8, end

add x10, x9, x8

1

2

3

4

5

6

7

8

9

Inst 1

IF

ID

EX

MEM

WB

Inst 2

IF

ID

EX

MEM

WB

Next Addr. Known

Is add or the instruction at end next?

15 of 15

Control Hazards

Solutions

  • More stalling! Again, not desirable because more NOPs = delay
  • Strategically order instructions to eliminate hazards! Code Optimization in Compiler
  • Branch Prediction: We guess / default to a condition
    • If Correct, keep executing! We’re good to go.
    • If Incorrect, we have to flush / reset the pipeline, then proceed

Example

loop: beq x9, x8, end

add x10, x9, x8

1

2

3

4

5

6

7

8

9

Inst 1

IF

ID

EX

MEM

WB

Next Addr. Known