JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 40

Instruction Level Parallelism and its Exploitation (Part 2)

Chapter 3

Appendix H

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 40

Outline

Dynamic Scheduling (3.4 & 3.5)
HW-Based Speculation (3.6)

Prof. Iyad Jafar

3 of 40

Dynamic Scheduling

Prof. Iyad Jafar

4 of 40

Introduction

Pipelining uses in-order issue and execute of instructions:

Instructions are issued in program order.
If an instruction is stalled, no later instructions can proceed.

Why not to start execution as soon as there is no hazard?

Issue instructions In-order, but the start execution as soon as the data is available.

This implies

Out-of-order execution.
Out-of-order completion.
WAR and WAW hazards may occur.

Stalled for no reason!

fdiv.d f0,f2,f4

fadd.d f10,f0,f8

fsub.d f12,f8,f14

Prof. Iyad Jafar

5 of 40

Dynamic Scheduling

Scheduling of instructions at compile time (static scheduling) is beneficial, but limited!

Compiler scheduling can not solve all dependencies.
Pipeline is stalled and no instructions are issued until the dependence is cleared.

Dynamic scheduling

Use HW to reorder instructions while maintaining data flow and exceptions.
Advantages:

Compiler doesn’t need to know the microarchitecture.
Handle runtime dependencies, such as cache misses, that are unknown at compile time.

Disadvantages:

HW cost.
Complicating exceptions.

Prof. Iyad Jafar

6 of 40

Idea and Challenges

Example 1

fdiv.d f0,f2,f4 fmul.d f6,f0,f8 fadd.d f0,f10,f14

fadd.d is not dependent on earlier instructions.
However, we can’t issue earlier without register renaming due WAR and WAW.

Example 2

fdiv.d f0,f2,f4 fadd.d f6,f0,f8

fsd f6,0(x1) fsub.d f8,f10,f14 fmul.d f6,f10,f8

fsub.d and fmul.d are not data dependent on previous instructions.
However, we can’t issue earlier (WAR and WAW) without register renaming.

Prof. Iyad Jafar

7 of 40

Idea and Challenges

Solving WAR and WAW can be done simply by register renaming using the compiler.

However, we have limited number of registers in ISA.
Dynamic scheduling algorithms uses additional HW with additional invisible registers for register renaming.

Scoreboarding (C.7) and Tomasulo

fdiv.d f0,f2,f4

fadd.d S,f0,f8

fsd S,0(x1)

fsub.d T,f10,f14

fmul.d f6,f10,T

fdiv.d f0,f2,f4 fadd.d f6,f0,f8

fsd f6,0(x1) fsub.d f8,f10,f14 fmul.d f6,f10,f8

Prof. Iyad Jafar

8 of 40

Tomasulo algorithm

Robert Tomasulo 1967, IBM.

Most modern high performance processors use a derivative of Tomasulo algorithm .

Avoids RAW by executing an instruction only when its operands are available.

Eliminate WAW and WAR by register renaming:

Renaming all destination registers including those with a pending read or write of an earlier instruction.

Prof. Iyad Jafar

9 of 40

Tomasulo algorithm

The fetch stage places the instruction into the instruction register or a queue of pending instructions.

Split the ID stage into:

Issue stage to decode instruction and check for dependence.
Read operands stage, wait until no data hazards then read operands.

Instructions are issued from the register or the queue in-order, but they could be stalled or bypassed from execution later.

Multiple instructions can be in execution!

Use of multiple functional units and/or pipelined units.

Prof. Iyad Jafar

10 of 40

Tomasulo algorithm

In Tomasulo, register renaming is provided by reservation stations

Buffers the operands of instruction waiting to issue instead of reading them from registers (avoids WAR).
Are associated with functional units (adders, multipliers, … ).
Pending instructions designate the reservation station that will produce their operands.
For successive register writes, the last write is used to update the register.
As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station that will produce the results.

Two important properties

Distributed hazard detection and execution control.
Results are passed directly to functional units from the reservation stations or registers (Common Data bus, CDB).

Prof. Iyad Jafar

11 of 40

Dynamic Scheduling - Tomasulo

No control is displayed.
Instructions are issued FIFO (in-order).
RS contain the operation, the actual operands, information to detect and resolve hazards
The load and store buffers hold data/addresses coming from and going to memory and behave almost exactly like reservation stations.

Prof. Iyad Jafar

12 of 40

Tomasulo Basic Steps

Fetch Step (F)

Get instructions from the instruction memory to the instruction queue.

Issue Step (I)

Get the next instruction from the head of the instruction queue.
If there is no RS/Buffer available, structural hazard occurs and instruction issue is stalled.
If there is available RS/buffer, issue the instruction with operands available in registers (solves WAR).
If operand(s) is not available in registers, keep track of the functional units that will produce the operands (solves RAW).

Prof. Iyad Jafar

13 of 40

Dynamic Scheduling - Tomasulo

Execute Step for FP ALU (E)

If one or more of the operands is not yet available, monitor CDB.
When an operand is ready, it is placed on CDB and copied to all RS waiting for it.
If all operands are ready, execute the instruction.
In case more than one FP ALU instruction in an RS are ready to execute, choose among them arbitrarily.

This is not the case for load and store.

Prof. Iyad Jafar

14 of 40

Dynamic Scheduling - Tomasulo

Execute Step for Load/Store (E)

Loads and stores require a two-step execution process; computing address and memory access.
When the base register is available, compute the address and place it in load/store buffer.
Load accesses the memory if it is available, while Store wait for the value to be stored.
Loads and stores in the buffers are executed in order to prevent hazards through memory.

In all cases, the assumption is that no instruction is allowed to initiate execution, until all branches that precede the instruction in program order have completed!

Prof. Iyad Jafar

15 of 40

Dynamic Scheduling - Tomasulo

Write Step (W)

When an RS produces result is, it is written on the CDB to update RS and registers waiting for the result.
Stores should wait in the store buffer until both the value to be stored and the address are available. The result is written as soon as the memory unit is free.

Prof. Iyad Jafar

16 of 40

Notes

Hazard detection and elimination HW is attached to registers, RS and load/store buffers (distributed hazard detection).

Once an instruction has been issued and is waiting for a source operand, it refers to the operand by the reservation station number where the instruction that will write the register has been assigned.

The Tomasulo scheme refers to the buffer or unit that will produce a result; the register names are discarded when an instruction is issued to a reservation station.

Prof. Iyad Jafar

17 of 40

Fields in Reservation Stations

Op - The operation to perform on source operands S1 and S2
Qj, Qk - hold the RS number that will produce the source operand

Zero value🡪 the source operand is already available in Vj or Vk

Vj, Vk - The value of the source operands

For loads, the Vk field is used to hold the offset field

A - for Load/store. Initially, the immediate field of the instruction is stored here; after the address calculation, the effective address is stored
Busy - Indicates that this reservation station and its accompanying functional unit are occupied.

Op	Qj	Qk	Vj	Vk	A	Busy

Prof. Iyad Jafar

18 of 40

Fields in Register File

Register file is modified such that each register has an additional field (Qi) that holds the number of RS that will produce the result to be written into this register.

Blank value implies that no active instruction is writing to this register and value is ready to be read.

Qi	Register Content
	…
	…

Prof. Iyad Jafar

19 of 40

Tomasulo Algorithm

Issue

Prof. Iyad Jafar

20 of 40

Tomasulo Algorithm

Execute

Write

Prof. Iyad Jafar

21 of 40

Example 1

Show what the information tables in Tomasulo look like for the following code sequence. Assume:

all instructions have issued but only the first load has completed and written its result.
assume the execute latencies in the table.

Operation	Execute Latency
LOAD	2
FP ADD	2
MULTIPLY	6
DIVIDE	12

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

22 of 40

Dynamic Scheduling - Tomasulo

The first load has completed and all instructions are issued.

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	…	24	25
I	A	M	W
	I	A	M	W
		I
			I
				I
					I

Compare to pipelined implementation with no dynamic scheduling!

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

23 of 40

Example 1

One instruction is issued each cycle.
Since we have sufficient RS, we can proceed to execute.
Instruction status table (not part of the implementation)

Prof. Iyad Jafar

24 of 40

Example 1

Prof. Iyad Jafar

25 of 40

Example 2

For the code in Example 1, show the information tables in Tomasulo for the following code sequence when the MUL.D is ready to write its result

Operation	Execute Latency
LOAD	1
FP ADD	2
MULTIPLY	6
DIVIDE	12

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

26 of 40

Example 2

MUL.D is about to write!

1	2	3	4	5	6	7	8	9	10	11	12	13	14	…	24	25	26
I	A	M	W
	I	A	M	W
		I	S	S	E	E	E	E	E	E	W
			I	S	E	E	W
				I	S	S	S	S	S	S	S	E	…	E	E	W
					I	S	S	E	E	W

Compare to pipelined implementation with no dynamic scheduling!

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

27 of 40

Example 2

Prof. Iyad Jafar

28 of 40

Example 3

Check Loop example on page 204

Prof. Iyad Jafar

29 of 40

Hardware-Based Speculation

Prof. Iyad Jafar

30 of 40

Introduction

Exploiting more ILP makes it harder to deal with control dependencies.
Dynamic scheduling:

Hold the issue of instructions until the preceding branches have resolved
Performance!?

Overcome control dependencies in dynamic scheduling using SW or HW speculation!

Fetch, issue and execute instruction assuming prediction is correct.
Need a mechanism to revert operations upon incorrect prediction.

The idea in speculation vs dynamic scheduling is that instructions are not allowed to change the registers and memory unless the speculation is correct!

Prof. Iyad Jafar

31 of 40

Extending Tomasulo

Separate bypassing results among instructions, from the actual. completion of an instruction:

Basically, instructions execute out-of-order but finish/commit in-order to prevent irreversible actions

Add a commit step which requires a hardware to:

Buffer the results of instructions that have finished execution but have not committed.
Pass results among instructions that may be speculated.

Reorder buffer (ROB) !

Prof. Iyad Jafar

32 of 40

Reorder Buffer

Provides additional registers to hold the result between the completion and committing of instructions, and the time to commit.

Since instructions are issued in order, they are assigned entries in ROB sequentially.

Supplies operands to other instructions; i.e. like RS, between completion and commit.

The result is not written from ROB to register file or memory until the instruction commits.

Prof. Iyad Jafar

33 of 40

Reorder Buffer

Each ROB entry has:

Instruction type (branch, register, store).
Destination field 🡪 supplies register number for Load and ALU, memory address for Store.
Value field 🡪 holds the value of the instruction result until it commits.
Ready field 🡪 indicates whether instruction has finished execution and value is ready.

Since instructions are issued in order, the assigned entries in ROB is sequential.
Renaming is provided by ROB

Results are tagged using ROB entries.
RS still buffer the operands.

Prof. Iyad Jafar

34 of 40

HW-Based Speculation

Prof. Iyad Jafar

35 of 40

Execution Steps

Fetch (F)
Issue (I)

If RS and ROB is not available, stall.
Otherwise, issue instruction and send operands from registers or ROB.
The number of assigned ROB entry is sent to RS.

Execute (E)

If one or more operands is not available, monitor the CDB for the register to be computed.
Execute when all operands are available.

Write (W)

When the result is available, write it on CDB with ROB tag assigned to instruction.
The result is stored in designated ROB entry(s) and any RS waiting the result.

Prof. Iyad Jafar

36 of 40

Execution Steps

Commit (C) – final stage and has three different cases depending on the instruction at head of ROB

Normal commit

When ALU or load instructions reaches the head of ROB and its result is available, the result is written to register file and the instruction is removed from ROB.

Store commit

similar to normal commit, except that the memory is updated.

Branch commit

if the instruction at head of ROB is branch with incorrect perdition, the ROB is flushed and execution is restarted from the correct address.
In case the prediction correct, the branch is finished.

Check the algorithm in Figure 3.18, p. 216

Prof. Iyad Jafar

37 of 40

Example 4

Show how the status tables look like when the MUL.D is ready to go to commit. Assume same latencies for FP units as in previous examples.

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

38 of 40

Example 4

MUL.D is ready to commit, i.e. it reached the head of ROB!
Note that committing instructions is in-order

In contrast to Example 2 (dynamic scheduling), the fsub.d and fadd.d are not allowed to complete/commit (write to registers) before the mul.d has finished.

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	…	24	25	26	27
F	I	A	M	W	C
	F	I	A	M	W	C
		F	I	S	S	E	E	E	E	E	E	W	C
			F	I	S	E	E	W	-	-	-	-	-	C
				F	I	S	S	S	S	S	S	S	E	E	…	E	E	W	C
					F	I	S	S	E	E	W	-	-	-	-	-	-	-	-	C

Compare to pipelined implementation with no dynamic scheduling!

fld f6,32(x2)

fld f2,44(x3)

fmul.d f0,f2,f4

fsub.d f8,f2,f6

fdiv.d f10,f0,f6

fadd.d f6,f8,f2

Prof. Iyad Jafar

39 of 40

Example 4

head

Prof. Iyad Jafar

At the time the MUL.D is ready to commit

only the two L.D instructions have committed, although several others have completed execution.
The MUL.D is at the head of the ROB, and the two L.D instructions are there only to ease understanding.
The SUB.D and ADD.D instructions will not commit until the MUL.D instruction commits, although the results of the instructions are available and can be used as sources for other instructions.
The DIV.D is in execution, but has not completed solely due to its longer latency than MUL.D.
The Value column indicates the value being held; the format #X is used to refer to a value field of ROB entry X.
Reorder buffers 1 and 2 are actually completed but are shown for informational purposes.
We do not show the entries for the load/store queue, but these entries are kept in order.

40 of 40

HW-Based Speculation

Notes – comparison Example 2 (dynamic scheduling)

No instruction after the earliest uncompleted instruction (MUL.D) is allowed to complete or commit.
This allows for precise exception treatment. Instructions following the offending instruction are flushed.
This is not applicable in dynamic scheduling since earlier instruction can update the state before it is known that a following instruction causes an exception.

Check loop example on p. 214.

Prof. Iyad Jafar

1	2	3	4	5	6	7	8	9	10	11	12	13	14	…	24	25	26
I	A	M	W
	I	A	M	W
		I	S	S	E	E	E	E	E	E	W
			I	S	E	E	W
				I	S	S	S	S	S	S	S	E	…	E	E	W
					I	S	S	E	E	W

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	…	24	25	26	27
F	I	A	M	W	C
	F	I	A	M	W	C
		F	I	S	S	E	E	E	E	E	E	W	C
			F	I	S	E	E	W	-	-	-	-	-	C
				F	I	S	S	S	S	S	S	S	E	E	…	E	E	W	C
					F	I	S	S	E	E	W	-	-	-	-	-	-	-	-	C

1	2	3	4	5	6	7	8	9	10	11	12	13	14	…	24	25	26
I	A	M	W
	I	A	M	W
		I	S	S	E	E	E	E	E	E	W
			I	S	E	E	W
				I	S	S	S	S	S	S	S	E	…	E	E	W
					I	S	S	E	E	W

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	…	24	25	26	27
F	I	A	M	W	C
	F	I	A	M	W	C
		F	I	S	S	E	E	E	E	E	E	W	C
			F	I	S	E	E	W	-	-	-	-	-	C
				F	I	S	S	S	S	S	S	S	E	E	…	E	E	W	C
					F	I	S	S	E	E	W	-	-	-	-	-	-	-	-	C

1	2	3	4	5	6	7	8	9	10	11	12	13	14	…	24	25	26
I	A	M	W
	I	A	M	W
		I	S	S	E	E	E	E	E	E	W
			I	S	E	E	W
				I	S	S	S	S	S	S	S	E	…	E	E	W
					I	S	S	E	E	W

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	…	24	25	26	27
F	I	A	M	W	C
	F	I	A	M	W	C
		F	I	S	S	E	E	E	E	E	E	W	C
			F	I	S	E	E	W	-	-	-	-	-	C
				F	I	S	S	S	S	S	S	S	E	E	…	E	E	W	C
					F	I	S	S	E	E	W	-	-	-	-	-	-	-	-	C