1 of 56

CMSC 611: Advanced Computer Architecture

Tomasulo

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides

Some material adapted from Hennessy & Patterson / © 2003 Elsevier Science

2 of 56

Out of Order Execution

  • Compiler rescheduling
    • Static reordering based on latency
    • Can’t span basic block
    • Can only optimize for one branch direction
  • CPU rescheduling
    • CPU reads instructions in order
    • Put instructions in a queue
    • Start executing instructions out of order
      • As resources and results are available

3 of 56

Tomasulo

  • IF/ID fills queue of instructions
    • Can go past branches, allowing FP operations beyond basic block in FP queue
  • “Reservation Station” per execution unit
    • Also for register result, load, store

4 of 56

Tomasulo Organization

4

  • The reservation stations hold instructions that had been issued and are awaiting execution at a functional unit
  • All results from FP FU and loads are broadcasted on the CDB

5 of 56

Tomasulo Example

5

6 of 56

Tomasulo Example Cycle 1

6

7 of 56

Tomasulo Example Cycle 2

7

Note: Can have multiple loads outstanding

8 of 56

Tomasulo Example Cycle 3

8

  • Note: registers names are removed (“renamed”) in Reservation Stations
  • Load1 completing; what is waiting for Load1?

9 of 56

Tomasulo Example Cycle 4

9

  • Load2 completing; what is waiting for it?

10 of 56

Tomasulo Example Cycle 5

10

11 of 56

Tomasulo Example Cycle 6

11

12 of 56

Tomasulo Example Cycle 7

12

  • Add1 completing; what is waiting for it?

13 of 56

Tomasulo Example Cycle 8

13

14 of 56

Tomasulo Example Cycle 9

14

15 of 56

Tomasulo Example Cycle 10

15

  • Add2 completing; what is waiting for it?

16 of 56

Tomasulo Example Cycle 11

16

17 of 56

Tomasulo Example Cycle 12

17

  • Note: all quick instructions complete already

18 of 56

Tomasulo Example Cycle 13

18

19 of 56

Tomasulo Example Cycle 14

19

20 of 56

Tomasulo Example Cycle 15

20

  • Mult1 completing; what is waiting for it?

21 of 56

Tomasulo Example Cycle 16

21

  • Note: Just waiting for divide

22 of 56

Tomasulo Example Cycle 55

22

23 of 56

Tomasulo Example Cycle 56

23

  • Mult 2 completing; what is waiting for it?

24 of 56

Tomasulo Example Cycle 57

24

  • Again, in-order issue, out-of-order execution, completion

25 of 56

Tomasulo “Register Renaming”

  • Operands to Functional Unit from Reservation Station, not registers
    • Replaced by value or pointer to another reservation stations
    • Don’t rely on the register name, avoids WAR, WAW hazards
    • More reservation stations than registers
    • Can do optimizations compilers cannot perform
  • Results broadcast over Common Data Bus

25

26 of 56

Reservation Station Components

  • Functional Unit Reservation Station
    • Op—Operation to perform in the unit (e.g., + or –)
    • Vj, Vk—Value of Source operands
      • Store buffers has V field, result to be stored
    • Qj, Qk—Reservation stations producing source registers (value to be written)
      • Store buffers only have Qi for RS producing result
    • Busy—Indicates reservation station or FU is busy
  • Register Reservation Station
    • Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

26

27 of 56

Three Stages of Tomasulo Algorithm

  1. Issue—get instruction from FP Op Queue
    • If reservation station free (no structural hazard), control issues instruction & sends operands (renames registers).
  2. Execution—operate on operands (EX)
    • When both operands ready then execute; if not ready, watch Common Data Bus for result
  3. Write result—finish execution (WB)
    • Write on Common Data Bus to all awaiting units; mark reservation station available

27

28 of 56

Normal vs. Common Data Bus

  • Normal data bus: data + destination
    • (“go to” bus)
  • Common data bus: data + source
    • (“come from” bus)
    • 64 bits of data + 4 bits of Functional Unit source address
    • Write if matches expected Functional Unit (produces result)
    • Broadcast: one to many

28

29 of 56

Tomasulo Drawbacks

  • Circuit complexity
  • Many associative stores (CDB) at high speed
  • Performance limited by Common Data Bus
    • Multiple CDBs → more FU logic for parallel associative stores

29

30 of 56

Tomasulo Loop Example

Loop: LD F0, 0(R1)

MULTD F4, F0, F2

SD F4, 0(R1)

SUBI R1, R1, #8

BNEZ R1, Loop

  • Assume Multiply takes 4 clocks
  • Assume first load takes 8 clocks (cache miss), second load takes 4 clocks (hit)
  • To be clear, will show clocks for SUBI, BNEZ
  • Reality, integer instructions ahead

30

31 of 56

Loop Example Cycle 0

31

32 of 56

Loop Example Cycle 1

32

33 of 56

Loop Example Cycle 2

33

34 of 56

Loop Example Cycle 3

34

  • Note: MULT1 has no registers names in RS

35 of 56

Loop Example Cycle 4

35

  • Issue SUBI

36 of 56

Loop Example Cycle 5

36

  • Issue BNEZ

37 of 56

Loop Example Cycle 6

37

  • Note: F0 never sees Load1 result

38 of 56

Loop Example Cycle 7

38

  • Note: MULT2 has no registers names in RS

39 of 56

Loop Example Cycle 8

39

40 of 56

Loop Example Cycle 9

40

  • Load1 completing; what is waiting for it?

41 of 56

Loop Example Cycle 10

41

  • Load2 completing; what is waiting for it?

42 of 56

Loop Example Cycle 11

42

43 of 56

Loop Example Cycle 12

43

44 of 56

Loop Example Cycle 13

44

45 of 56

Loop Example Cycle 14

45

  • Mult1 completing; what is waiting for it?

46 of 56

Loop Example Cycle 15

46

  • Mult2 completing; what is waiting for it?

47 of 56

Loop Example Cycle 16

47

48 of 56

Loop Example Cycle 17

48

49 of 56

Loop Example Cycle 18

49

50 of 56

Loop Example Cycle 19

50

51 of 56

Loop Example Cycle 20

51

52 of 56

Loop Example Cycle 21

52

53 of 56

Multiple Issue

  • Static: Integer/FP instruction pairs
  • Dynamic: Extend Tomasulo
    • Must handle dependencies on issue
    • Two instructions: split clock in half
    • More: Explicit dependency issue logic

53

54 of 56

Reorder Buffers

  • Most modern processors use a variation of Tomasulo
  • Add Reorder Buffer (ROB) in place of register stations & memory store
  • Write from ROB in order issued
  • Makes sure earlier instructions complete
    • Simplifies exceptions & speculation

54

55 of 56

ROB Changes

  1. ROB holds:
    • Destination (register or mem address)
    • Slot for value (when done)
  2. Other RS identified by ROB#
    • Each RS knows what ROB# it is going to
    • CDB b-cast: ROB1/value vs. Add2/value
  3. Retire phase
    • Check oldest ROB entries
    • If done, “commit” write to reg or memory

56 of 56

Speculation with ROB

  • Roll back PC
  • Discard invalid ROB entries
  • OK to leave everything else as-is
    • Stuff in execution units can run, but nobody will be listening for the result on the CDB
  • Exceptions happen at ROB retire