EE360N: Computer Architecture Lecture #19
Department of Electrical and Computer Engineering
The University of Texas at Austin 12/5/2008
Disclaimer: The contents of this document are scribe notes for The University of Texas at Austin EE360N Fall 2008, Computer Architecture.* The notes capture the class discussion and may contain erroneous and unverified information and comments.
In-Order Retirement and Branch Prediction
Lecture #19: 11/05/2008
Lecturer: Prof. Derek Chiou
Scribes: Richard Llaca Jr., Ramapriyan Chakravarthy
Recap and Outline
In the previous lectures, we discussed how dependences in the pipeline can be resolved by using scoreboarding and Tomasulo's Algorithm. Today's lecture will focus on (i) understanding the differences between scoreboarding and Tomasulo's algorithm and (ii) out-of-order execution and in-order completion.
Scoreboarding vs. Tomasulo's Algorithm.
Consider a sequence of instructions given below. We shall now see how scoreboarding and Tomasulo's will execute the above set of instructions and compare their performance. We make the following assumptions about the architecture common to both the techniques:
The multiply unit takes 6 cycles to execute.
The adder unit takes 5 cycles to execute.
The Functional Units are non-pipelined.
A dependent instruction can be started during the write-back stage of the previous instruction because of forwarding.
Instruction 1: R3 = R1 * R2
Instruction 2: R5 = R3 + R4
Instruction 3: R7 = R2 + R6
Instruction 4: R10 = R8 + R9
Instruction 5: R11 = R7 * R10
Instruction 6: R5 = R5 + R11
Figure 1: Completed ScoreBoarding Example
The state of the pipeline and functional units after executing the instructions using score-boarding technique is as given above. We assume that the number of cycles for execution in a FU includes the cycles required for reading the operand registers. What happens when each instruction executes is given below, in program order.
It executes without stalls and produces output at cycle 8.
There is a potential RAW hazard due to R3. But, since destination register (R5) is independent and the Adder is free, this instruction is dispatched to the Adder FU but because of the RAW hazard, the instruction waits in the Adder till Instruction 1 finishes execution. During cycle 8, Instruction 1 writes back the result and due to forwarding, Instruction 2 reads R3 value and starts execution.
This instruction also needs the Adder, but since the Adder is already occupied, this instruction has to wait in decode stage till the Adder finishes Instruction 2. It completes execution at cycle 16.
This instruction is stalled at Fetch till Instruction 3 is dispatched/issued to the Adder. Since this instruction also needs the Adder, it can only dispatch/issue after Instruction 3 completes its execution and is in the process of writing back.
This instruction is fetched at cycle 12 and goes to decode at cycle 15 because of Instruction 4 is stalled in Decode. It is dispatched to the Multiplier at cycle 17. Because there is a potential RAW hazard due to R10, it waits for 3 cycles in the multiply FU without executing.
Again, there is a potential RAW hazard due to R11. This instruction waits in decode for 2 cycles since the Adder is being used by Instruction 4. Because of R11 dependency, it waits in the Adder for an additional 6 cycles. Once R11 is available it executes in 4 cycles and produces output at cycle 30.
Figure 2: Completed Tomasulo Example
The state of the pipeline and functional units after executing the instructions using Tomasulo's algorithm technique is show in Figure 2. We assume that an instruction can bypass the reservation station and go directly to an FU when the FU is free. Also, we assume that there is only one reservation station for each Functional Unit. What happens when each instruction executes is given below, in program order.
This instruction behaves same as in scoreboarding.
There is a potential RAW hazard. So this instruction is dispatched but waits in the Reservation Station till the first instruction finishes. Instead of blocking and waiting for the operand in the adder as in scoreboarding, it waits in the reservation station. While it is waiting in the reservation station, the next instruction is dispatched/issued and occupies the Adder. So Instruction 2 has to wait till Instruction 3 finishes and frees up the Adder. This is a case where the execution time of an instruction depends on a later instruction.
This instruction needs the Adder and is independent of the previous 2 instructions. Because all of the arguments are available, the FU is available and there are sufficient rename tags, this instruction bypasses the reservation station and goes directly to the Adder. At cycle 8, this instruction is ready to write-back, but so is Instruction 1 (structural hazard). In this implementation, we assume that Instruction 1 gets preference since it is older. So the completed Instruction 3 waits for another cycle in the Adder before write back. But an important point to consider at this point is whether it is better to let the older instruction take precedence when we do write back. It might be better to let the instruction which has more instructions dependent on it to go first rather than the older instruction since that instruction will enable more instructions to go to execute stage and this policy might potentially give better performance than the earliest-instruction-first policy. However, doing so may increase the complexity of the scheduler to compute the number of dependencies.
This instruction has no data dependency, but when it is ready to execute, the Adder is not free. So it waits till Adder becomes free to execute.
There is a RAW hazard because of R10. So this instruction waits in RS till Instruction 4 finishes execution.
This instruction depends on Instruction 5. So this waits in the RS of the Adder until Instruction 4 completes. The output is available at cycle 27.
Thus, from the above example, we see that Tomasulo's performs better than scoreboarding by 4 cycles.
In general, it can be said that Tomasulo's algorithm performs better than scoreboarding because (i) it doesn't block instructions which are ready by using reservation stations, and (ii) it allows WAR and WAW hazards to proceed by register renaming. For example, an instruction stream that has a set of instructions which all write to the same destination registers is much better suited for Tomasulo than scoreboarding.
Out-of-order Execution/In-Order Completion
The problem with Scoreboarding and Tomasulo's algorithm is that they allow instructions to complete out-of-order, i.e., newer instructions can complete before older instructions. Does this matter? During a normal program run it is ok for instructions to complete out of order as long as the data dependences are maintained because we will still get the right output from the program. Program order does, however, matter when there are interrupts and exceptions. Consider a simple case where there is a page fault exception during a particular memory access instruction. Typically, this would cause the PC to jump to the exception handler that would then map the required page to physical memory and then finally, return to the main program to re-execute the exception causing instruction. If we allowed out-of-order completion, the instructions after this particular instruction might have already completed and updated the registers. Now, after returning from the exception, those instructions would be executed again, potentially resulting in incorrect program behavior. Therefore we need to worry about the order in which instructions complete.
The ideal solution to the problem of Out-of-order Execution is In-Order Completion. This means that we allow the program to execute in any order but complete the instructions in program order. This means that we don't allow a particular instruction which finishes execution to save/update the state of the machine into permanent state until we are sure that it's the oldest instruction. If it’s not the oldest instruction and it has finished execution, that means that it has executed out-of-order, so we need to wait till earlier instructions complete before making changes to the processor state. This gives us the capability to 'roll-back' instructions because, when exceptions occur, we don't need to worry about instructions which finished earlier since they didn't modify the processor state at all.
In-Order Completion requires us to provide the appearance that instructions write-back in dispatch order (since Fetch, Decode and Dispatch occur in-order) rather than the order in which instructions finish execution. This means that we need to update both memory and architectural registers in program order. Another implication is that there must be a way to use non-retired state. Tomasulo's method already allows this in the sense that, when an instruction completes, it provides the output value to whoever needs it through the CDB. So, this recently completed instruction, which is not necessarily the oldest, has produced an output which is in a non-retired state until it updates the architectural register. But Tomasulo's doesn't allow us to un-complete (or rollback) a completed instruction. For example, consider a sequence of 2 instructions where the destination register for both is R7. The first of those instructions never writes back the result anywhere because of register renaming. But if we want an interrupt at the end of that first instruction, then we wouldn't be able to do it in the Tomasulo architecture that we've seen so far.
To summarize, we need 2 things to enable in-order retirement in an architecture which allows out-of-order execution:
The ability to track the oldest instruction and each uncompleted write to a register.
The ability to roll back to a specific un-completed instruction.
Re-Order Buffer (ROB)
The basic structure that's traditionally used to achieve Out-of-Order completion is called the re-order buffer or the completion buffer (invented by Prof. Patt). The ROB is basically a circular data structure (FIFO with head (oldest instruction) and tail (newest instruction) pointers) which tracks all outstanding instructions in issue order. Whenever an instruction is dispatched, an entry is added into the ROB at the tail so that the order in which instructions need to be retired is given by the order in which the instructions have been enqueued into the ROB. Whenever an instruction finishes, it only updates its corresponding entry in the ROB instead of directly updating the processor state. The instruction at the head of the ROB is the oldest instruction (by definition of FIFO) and so instructions are retired by reading out the entry at the head of the ROB and using that data to update the required architectural state. This way, whenever we need to roll back to a particular instruction, we only need to ignore all the entries in ROB after that instruction. Thus the ROB entries are like bread crumbs to determine the state of each instruction and it is similar to scoreboarding in the sense that it is the centralized data structure which determines the order of instruction retirement.
In-Order State Update
Architectural Register File Rename Register File
Figure 3: Separate Architectural Register File and Rename Register File
A possible strategy to implement in-order completion uses a separate architectural register file containing data, a tag, and committed bit, and a rename register file, indexed by rename tags and containing all pre-committed values. If the committed bit is set then the value in the architectural register file can be used; if not, then the rename register file is checked to see if the associated value is valid or not. If it is valid, that value is used, otherwise the tag is used.
For example, R3’s value has not been committed, so the tag field is consulted. The tag field contains tag A. Looking to the other table we see the value at entry A is not valid. This means A is somewhere in the machine being computed, so the value is not available until the instruction finishes and the data is written back to the rename register file. Thus, the tag A is used instead of a value.
For R0, the committed bit is true which tells us that the value can be read.
For R2, the value is uncommitted in the architectural register file, but the uncommitted value can be read from the rename register file at location B since it is valid there.
The number of rename registers required is the same number as the number of tags. From a functionality perspective only one is required, but that would limit the number of renamed instructions that have not yet completed to one. In general, the number of tags is the number of such instructions.
There are lots of ways to implement a ROB. One particular example is shown in Figure 4. Each table entry has a valid bit, an issued bit to see if it has been issued, a done bit to signal if the instruction is finished, the instruction’s address, the architectural register, and the rename register. An instruction can be retired when it is the oldest instruction and the oldest instruction is done. However, at the current state in this example, not only has the oldest instruction not finished, it has not yet been issued. One can also tell that this instruction is a branch because it doesn’t list any architectural or rename registers being used and the next instruction is at a non-incremental address.
When an instruction is completed, its architectural state updates are performed. The rename register file value is read and written to the appropriate architectural register file entry. The basic ROB data is written at Decode/Dispatch. The dispatcher writes the known data into the ROB. At Decode/Dispatch the valid bit, instruction address, architectural register, and rename register are known and are written. As the instruction goes to the ALU, then some logic will set the issued bit, and then as the instruction completes it will come back and set the done it.
Figure 5: Combined ROB and Value
A variation is shown in . The previous method had an architectural register, a rename register, and an ROB. Well now we can combine the ROB and the rename register file by putting the computed value in the ROB and eliminating the rename register file. The ROB will hold the new value computed by the finished instruction and the tag that we will place in the architectural register file is just the ROB entry number. The ROB also holds the name of the destination architectural register that corresponds to the computed value. Our architectural register file still has the same information, the value of the register, the tag, and a valid/completed bit.
Alternative: Explicit Register Renaming
In this alternative, there are only physical registers and no architectural registers. In such cases, a physical register file larger than the number of architectural registers is provided along with an indirection table, called the register alias table or RAT that maps architectural register names to the actual physical register. So every time an architectural register is read, the look up table is read to give the address of the physical register and then the physical register is read. This means that we cannot reclaim the physical register when an instruction completes because physical registers, unlike tags, sometimes hold the value of an architectural register, and are not just for renaming.
Figure 6: Explicit Register Renaming
In Figure 6, R3 points to P2 which is the physical register. If R3 is never the destination register again, then R3 will always point to P2. However, we know that at least one instruction after the one where R3 is mapped to P2 has been renamed because the register is marked Uncommitted (the U) indicating that the uncommitted column in the register alias table points to the current, uncommitted physical register that R3 maps to. The valid bit indicates whether the uncommitted physical register has been computed and written to the physical register file. Thus, if the register is marked Committed, use the value in the physical register pointed to by the Committed column (there is no such thing as a Committed but invalid register since a Committed instruction must have completed and written back its value.) If the register is marked Uncommitted, it can either be valid or invalid depending on whether the uncommitted register has been written back to.
Thus, the latest, uncommitted value for R3 is in P6, but the committed value of R3 is in P2. If another instruction uses R3 as a source, then the value in the P6 entry of the physical register file is used. If another instruction uses R3 as a destination, then another physical register would be allocated to it, overwriting the P6 in the RAT and setting the valid bit to be false, since the value hasn’t been written back to the register file yet.
Reorder Buffer with Explicit Renaming
One possible reorder buffer for the explicit renaming scheme looks like the following figure:
Figure 7: ROB with Explicit Register Renaming
The ROB contains a destination register field, containing the name of the destination architectural register, and a physical register field, that contains the name of the physical register that contains or will contain that instruction’s value of the destination register. When an instruction is retired, the physical register field entry value is copied to the RAT committed entry for the destination architectural register. Data is not copied, only the pointer is updated. The Committed Boolean field in the RAT, however, is not set to Committed. The Committed field indicates that the Uncommitted column does not contain a valid value that is only true after a rollback to Committed state.
This algorithm ensures that, when exceptions occur, the committed entries of the indirection table are correct. In other words, the committed field of the table is updated when retiring instructions by looking at ROB and the uncommitted field of the table is updated in the decode/dispatch stage of the pipeline by looking at the destination register field of the instruction.
But notice that whenever a particular committed entry is updated, the physical register that was overwritten can be reused/deallocated. Naively doing so requires an extra port on the RAT so that the original value can be read before it is overwritten. An alternative is that the structure of the ROB could be augmented to include the old physical register that was in the uncommitted field of the RAT (the immediate preceding version of that architectural register that will be overwritten when this instruction commits) when this instruction was renamed.
Figure 8: Alternative Technique of ROB using Explicit Register Renaming
Using this organization of the ROB, the RAT does not need to be read to reclaim the overwritten register. Instead, the value in the "Old Physical Reg" entry of the ROB is the physical register that can be reclaimed. The trade-off for reducing the number of read ports in the indirection table is that extra memory space is required in the ROB.
* Copyright 2008 Ho Cheung, Gregory Posey, and Derek Chiou, all rights reserved. This work may be reproduced and redistributed, in whole or in part, without prior written permission, provided all copies cite the original source of the document including the names of the copyright holders and “The University of Texas at Austin EE360N Fall 2008, Computer Architecture”.