1 of 40

Instruction Level Parallelism and its Exploitation (Part 2)

Chapter 3

Appendix H

1

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 40

Outline

  • Dynamic Scheduling (3.4 & 3.5)
  • HW-Based Speculation (3.6)

2

Prof. Iyad Jafar

3 of 40

Dynamic Scheduling

3

Prof. Iyad Jafar

4 of 40

Introduction

  • Pipelining uses in-order issue and execute of instructions:
    • Instructions are issued in program order.
    • If an instruction is stalled, no later instructions can proceed.

  • Why not to start execution as soon as there is no hazard?
    • Issue instructions In-order, but the start execution as soon as the data is available.
  • This implies
    • Out-of-order execution.
    • Out-of-order completion.
    • WAR and WAW hazards may occur.

4

Stalled for no reason!

fdiv.d f0,f2,f4

fadd.d f10,f0,f8

fsub.d f12,f8,f14

Prof. Iyad Jafar

5 of 40

Dynamic Scheduling

  • Scheduling of instructions at compile time (static scheduling) is beneficial, but limited!
    • Compiler scheduling can not solve all dependencies.
    • Pipeline is stalled and no instructions are issued until the dependence is cleared.
  • Dynamic scheduling
    • Use HW to reorder instructions while maintaining data flow and exceptions.
    • Advantages:
      • Compiler doesn’t need to know the microarchitecture.
      • Handle runtime dependencies, such as cache misses, that are unknown at compile time.
    • Disadvantages:
      • HW cost.
      • Complicating exceptions.

5

Prof. Iyad Jafar

6 of 40

Idea and Challenges

  • Example 1

6

fdiv.d f0,f2,f4 fmul.d f6,f0,f8 fadd.d f0,f10,f14

  • fadd.d is not dependent on earlier instructions.
  • However, we can’t issue earlier without register renaming due WAR and WAW.
  • Example 2

fdiv.d f0,f2,f4 fadd.d f6,f0,f8

fsd f6,0(x1) fsub.d f8,f10,f14 fmul.d f6,f10,f8

  • fsub.d and fmul.d are not data dependent on previous instructions.
  • However, we can’t issue earlier (WAR and WAW) without register renaming.

??

??

Prof. Iyad Jafar

7 of 40

Idea and Challenges

  • Solving WAR and WAW can be done simply by register renaming using the compiler.

  • However, we have limited number of registers in ISA.
  • Dynamic scheduling algorithms uses additional HW with additional invisible registers for register renaming.
    • Scoreboarding (C.7) and Tomasulo

7

fdiv.d f0,f2,f4

fadd.d S,f0,f8

fsd S,0(x1)

fsub.d T,f10,f14

fmul.d f6,f10,T

fdiv.d f0,f2,f4 fadd.d f6,f0,f8

fsd f6,0(x1) fsub.d f8,f10,f14 fmul.d f6,f10,f8

Prof. Iyad Jafar

8 of 40

Tomasulo algorithm

  • Robert Tomasulo 1967, IBM.
    • Most modern high performance processors use a derivative of Tomasulo algorithm .

  • Avoids RAW by executing an instruction only when its operands are available.

  • Eliminate WAW and WAR by register renaming:
    • Renaming all destination registers including those with a pending read or write of an earlier instruction.

8

Prof. Iyad Jafar

9 of 40

Tomasulo algorithm

  • The fetch stage places the instruction into the instruction register or a queue of pending instructions.

  • Split the ID stage into:
    • Issue stage to decode instruction and check for dependence.
    • Read operands stage, wait until no data hazards then read operands.

  • Instructions are issued from the register or the queue in-order, but they could be stalled or bypassed from execution later.

  • Multiple instructions can be in execution!
    • Use of multiple functional units and/or pipelined units.

9

Prof. Iyad Jafar

10 of 40

Tomasulo algorithm

  • In Tomasulo, register renaming is provided by reservation stations
    • Buffers the operands of instruction waiting to issue instead of reading them from registers (avoids WAR).
    • Are associated with functional units (adders, multipliers, … ).
    • Pending instructions designate the reservation station that will produce their operands.
    • For successive register writes, the last write is used to update the register.
    • As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station that will produce the results.
  • Two important properties
    • Distributed hazard detection and execution control.
    • Results are passed directly to functional units from the reservation stations or registers (Common Data bus, CDB).

10

Prof. Iyad Jafar

11 of 40

Dynamic Scheduling - Tomasulo

11

  • No control is displayed.
  • Instructions are issued FIFO (in-order).
  • RS contain the operation, the actual operands, information to detect and resolve hazards
  • The load and store buffers hold data/addresses coming from and going to memory and behave almost exactly like reservation stations.

Prof. Iyad Jafar

12 of 40

Tomasulo Basic Steps

  1. Fetch Step (F)
    • Get instructions from the instruction memory to the instruction queue.
  2. Issue Step (I)
    • Get the next instruction from the head of the instruction queue.
    • If there is no RS/Buffer available, structural hazard occurs and instruction issue is stalled.
    • If there is available RS/buffer, issue the instruction with operands available in registers (solves WAR).
    • If operand(s) is not available in registers, keep track of the functional units that will produce the operands (solves RAW).

12

Prof. Iyad Jafar

13 of 40

Dynamic Scheduling - Tomasulo

  1. Execute Step for FP ALU (E)
    • If one or more of the operands is not yet available, monitor CDB.
    • When an operand is ready, it is placed on CDB and copied to all RS waiting for it.
    • If all operands are ready, execute the instruction.
    • In case more than one FP ALU instruction in an RS are ready to execute, choose among them arbitrarily.

    • This is not the case for load and store.

13

Prof. Iyad Jafar

14 of 40

Dynamic Scheduling - Tomasulo

  1. Execute Step for Load/Store (E)
    • Loads and stores require a two-step execution process; computing address and memory access.
    • When the base register is available, compute the address and place it in load/store buffer.
    • Load accesses the memory if it is available, while Store wait for the value to be stored.
    • Loads and stores in the buffers are executed in order to prevent hazards through memory.

14

In all cases, the assumption is that no instruction is allowed to initiate execution, until all branches that precede the instruction in program order have completed!

Prof. Iyad Jafar

15 of 40

Dynamic Scheduling - Tomasulo

  1. Write Step (W)
    • When an RS produces result is, it is written on the CDB to update RS and registers waiting for the result.
    • Stores should wait in the store buffer until both the value to be stored and the address are available. The result is written as soon as the memory unit is free.

15

Prof. Iyad Jafar

16 of 40

Notes

  • Hazard detection and elimination HW is attached to registers, RS and load/store buffers (distributed hazard detection).

  • Once an instruction has been issued and is waiting for a source operand, it refers to the operand by the reservation station number where the instruction that will write the register has been assigned.

  • The Tomasulo scheme refers to the buffer or unit that will produce a result; the register names are discarded when an instruction is issued to a reservation station.

16

Prof. Iyad Jafar

17 of 40

Fields in Reservation Stations

  • Op - The operation to perform on source operands S1 and S2
  • Qj, Qk - hold the RS number that will produce the source operand
    • Zero value🡪 the source operand is already available in Vj or Vk
  • Vj, Vk - The value of the source operands
    • For loads, the Vk field is used to hold the offset field
  • A - for Load/store. Initially, the immediate field of the instruction is stored here; after the address calculation, the effective address is stored
  • Busy - Indicates that this reservation station and its accompanying functional unit are occupied.

17

Op

Qj

Qk

Vj

Vk

A

Busy

Prof. Iyad Jafar

18 of 40

Fields in Register File

  • Register file is modified such that each register has an additional field (Qi) that holds the number of RS that will produce the result to be written into this register.

  • Blank value implies that no active instruction is writing to this register and value is ready to be read.

18

Qi

Register Content

Prof. Iyad Jafar

19 of 40

Tomasulo Algorithm

  • Issue

19

Prof. Iyad Jafar

20 of 40

Tomasulo Algorithm

  • Execute

  • Write

20

Prof. Iyad Jafar

21 of 40

Example 1

  • Show what the information tables in Tomasulo look like for the following code sequence. Assume:
    • all instructions have issued but only the first load has completed and written its result.
    • assume the execute latencies in the table.

21

Operation

Execute Latency

LOAD

2

FP ADD

2

MULTIPLY

6

DIVIDE

12

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

22 of 40

Dynamic Scheduling - Tomasulo

  • The first load has completed and all instructions are issued.

22

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

24

25

I

A

M

W

I

A

M

W

I

I

I

I

Compare to pipelined implementation with no dynamic scheduling!

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

23 of 40

Example 1

  • One instruction is issued each cycle.
  • Since we have sufficient RS, we can proceed to execute.
  • Instruction status table (not part of the implementation)

23

Prof. Iyad Jafar

24 of 40

Example 1

24

Prof. Iyad Jafar

25 of 40

Example 2

  • For the code in Example 1, show the information tables in Tomasulo for the following code sequence when the MUL.D is ready to write its result

25

Operation

Execute Latency

LOAD

1

FP ADD

2

MULTIPLY

6

DIVIDE

12

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

26 of 40

Example 2

  • MUL.D is about to write!

26

1

2

3

4

5

6

7

8

9

10

11

12

13

14

24

25

26

I

A

M

W

I

A

M

W

I

S

S

E

E

E

E

E

E

W

I

S

E

E

W

I

S

S

S

S

S

S

S

E

E

E

W

I

S

S

E

E

W

Compare to pipelined implementation with no dynamic scheduling!

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

27 of 40

Example 2

27

Prof. Iyad Jafar

28 of 40

Example 3

  • Check Loop example on page 204

28

Prof. Iyad Jafar

29 of 40

Hardware-Based Speculation

29

Prof. Iyad Jafar

30 of 40

Introduction

  • Exploiting more ILP makes it harder to deal with control dependencies.
  • Dynamic scheduling:
    • Hold the issue of instructions until the preceding branches have resolved
    • Performance!?
  • Overcome control dependencies in dynamic scheduling using SW or HW speculation!
    • Fetch, issue and execute instruction assuming prediction is correct.
    • Need a mechanism to revert operations upon incorrect prediction.
  • The idea in speculation vs dynamic scheduling is that instructions are not allowed to change the registers and memory unless the speculation is correct!

30

Prof. Iyad Jafar

31 of 40

Extending Tomasulo

  • Separate bypassing results among instructions, from the actual. completion of an instruction:
    • Basically, instructions execute out-of-order but finish/commit in-order to prevent irreversible actions
  • Add a commit step which requires a hardware to:
    • Buffer the results of instructions that have finished execution but have not committed.
    • Pass results among instructions that may be speculated.

  • Reorder buffer (ROB) !

31

Prof. Iyad Jafar

32 of 40

Reorder Buffer

  • Provides additional registers to hold the result between the completion and committing of instructions, and the time to commit.
    • Since instructions are issued in order, they are assigned entries in ROB sequentially.

  • Supplies operands to other instructions; i.e. like RS, between completion and commit.

  • The result is not written from ROB to register file or memory until the instruction commits.

32

Prof. Iyad Jafar

33 of 40

Reorder Buffer

  • Each ROB entry has:
    • Instruction type (branch, register, store).
    • Destination field 🡪 supplies register number for Load and ALU, memory address for Store.
    • Value field 🡪 holds the value of the instruction result until it commits.
    • Ready field 🡪 indicates whether instruction has finished execution and value is ready.
  • Since instructions are issued in order, the assigned entries in ROB is sequential.
  • Renaming is provided by ROB
    • Results are tagged using ROB entries.
    • RS still buffer the operands.

33

Prof. Iyad Jafar

34 of 40

HW-Based Speculation

34

Prof. Iyad Jafar

35 of 40

Execution Steps

  • Fetch (F)
  • Issue (I)
    • If RS and ROB is not available, stall.
    • Otherwise, issue instruction and send operands from registers or ROB.
    • The number of assigned ROB entry is sent to RS.
  • Execute (E)
    • If one or more operands is not available, monitor the CDB for the register to be computed.
    • Execute when all operands are available.
  • Write (W)
    • When the result is available, write it on CDB with ROB tag assigned to instruction.
    • The result is stored in designated ROB entry(s) and any RS waiting the result.

35

Prof. Iyad Jafar

36 of 40

Execution Steps

  • Commit (C) – final stage and has three different cases depending on the instruction at head of ROB
    • Normal commit
      • When ALU or load instructions reaches the head of ROB and its result is available, the result is written to register file and the instruction is removed from ROB.
    • Store commit
      • similar to normal commit, except that the memory is updated.
    • Branch commit
      • if the instruction at head of ROB is branch with incorrect perdition, the ROB is flushed and execution is restarted from the correct address.
      • In case the prediction correct, the branch is finished.

  • Check the algorithm in Figure 3.18, p. 216

36

Prof. Iyad Jafar

37 of 40

Example 4

  • Show how the status tables look like when the MUL.D is ready to go to commit. Assume same latencies for FP units as in previous examples.

37

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

38 of 40

Example 4

  • MUL.D is ready to commit, i.e. it reached the head of ROB!
  • Note that committing instructions is in-order

    • In contrast to Example 2 (dynamic scheduling), the fsub.d and fadd.d are not allowed to complete/commit (write to registers) before the mul.d has finished.

38

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

24

25

26

27

F

I

A

M

W

C

F

I

A

M

W

C

F

I

S

S

E

E

E

E

E

E

W

C

F

I

S

E

E

W

-

-

-

-

-

C

F

I

S

S

S

S

S

S

S

E

E

E

E

W

C

F

I

S

S

E

E

W

-

-

-

-

-

-

-

-

C

Compare to pipelined implementation with no dynamic scheduling!

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

39 of 40

Example 4

39

head

x

x

Prof. Iyad Jafar

40 of 40

HW-Based Speculation

  • Notes – comparison Example 2 (dynamic scheduling)
    • No instruction after the earliest uncompleted instruction (MUL.D) is allowed to complete or commit.
    • This allows for precise exception treatment. Instructions following the offending instruction are flushed.
    • This is not applicable in dynamic scheduling since earlier instruction can update the state before it is known that a following instruction causes an exception.

  • Check loop example on p. 214.

40

Prof. Iyad Jafar