1 of 40

CS-773 Paper Presentation�Improving the Utilization of �Micro-operation Caches in x86 Processors

Sumon Nath�Hyperthreads(#6)�sumon@cse.iitb.ac.in

1

Some of the figures are adapted and modified from the paper

2 of 40

Coming up

2

  • Background - Micro op cache
  • Impact of Micro op cache
  • Motivation - Fragmentation
  • Solutions proposed
  • Conclusion

3 of 40

Background: CPU frontend

3

4 of 40

Background: CPU frontend

4

Range of address ~ basic block

5 of 40

Background: CPU frontend

5

Variable length ISA -> Power hungry decoder

6 of 40

Background: CPU frontend

6

Fixed length micro-op(uop) -> simpler execution logic

7 of 40

Background: CPU frontend

7

Uops cached into uop cache

8 of 40

Background: CPU frontend

8

Hit uop cache -> bypass fetch & decode

9 of 40

Prediction window(PW)

9

  • PW : can start and end anywhere in a cache line
  • Termination conditions

(1) I-cache line boundary (2) Predicted taken branch

10 of 40

Uop cache entry

10

11 of 40

Uop cache entry

11

12 of 40

Uop cache entry

12

13 of 40

Uop cache entry

13

14 of 40

Uop cache entry

14

  • PW termination condition -> uop cache entry written to uop cache line
  • Uop cache entry: set of uops
  • Takeaway: uop cache entry may not occupy entire uop cache line

uop cache line

cache entry

15 of 40

Impact of Uop cache

15

25%

40%

16 of 40

Impact of Uop cache

16

  • Average UPC improvement: 11.2 % (64K uop cache)
  • Average Reduction Decoder power consumption: 39.2 %

17 of 40

What next?

17

  • Throw money: Increase Uop cache size

  • Optimize the current Uop cache design

18 of 40

Motivation: Fragmentation

18

  • 72 % of Uop cache lines are highly fragmented
  • Main reason: termination conditions

*cache line size: 64B

19 of 40

Fragmentation source: I-cache line boundary

19

  • Termination condition leads to smaller uop cache entries
  • Low uop cache utilization

20 of 40

Fragmentation source: Predicted taken branch

20

  • Termination condition leads to smaller uop cache entries
  • Low uop cache utilization

21 of 40

Fragmentation source: other conditions

21

Other terminating conditions:

  • Max. no. of immediate/displacement values per cache line
  • Max. no. of microcoded instructions per cache line

22 of 40

Solutions proposed

22

Two solutions to reduce fragmentation:

  • CLASP
  • Compaction
              • RAC
              • PWAC
              • Forced-PWAC

23 of 40

Cache line boundary agnostic Uop (CLASP)

23

I-cache line

boundary

Sequential

code

(no branch)

24 of 40

Cache line boundary agnostic Uop (CLASP)

24

Merged

  • Relaxes I-cache line boundary termination condition
  • Merges uop cache entries from sequential code
  • Doubles dispatch bandwidth

25 of 40

Cache line boundary agnostic Uop (CLASP)

25

On average 35% of Uop cache entries are merged by CLASP

26 of 40

Pause

26

Any questions?

27 of 40

Fragmentation source: Predicted taken branch

27

28 of 40

Fragmentation source: Predicted taken branch

28

On average 49% of Uop cache entries terminated by predicted taken branches

29 of 40

Compaction

29

  • Target termination conditions other than I-cache line boundary
  • Compacts Uop cache entries in a single cache line
  • Entries are not merged together to a single entry unlike CLASP
  • Dispatch bandwidth remains same unlike CLASP

30 of 40

Compaction: Challenge

30

  • Which Uop cache entries to compact together?

31 of 40

Replacement aware compaction(RAC)

31

MRU

  • Compacts new Uop cache entry with the MRU line in cache set
  • Ensures Uop cache entries are temporally correlated

Note: compaction at the time of Uop cache fill

32 of 40

PW aware compaction (PWAC)

32

MRU

*Overrides RAC

Attempts to compact new Uop cache entry with entries from same Prediction window

33 of 40

Forced-PWAC

33

T0

34 of 40

Forced-PWAC

34

T0

T1

Forces compaction of entries from same PW

35 of 40

Compaction technique distribution

35

  • Note: all 3 compaction techniques work simultaneously
  • Usage priority: F-PWAC > PWAC > RAC
  • Almost equal distribution

36 of 40

Evaluation setup

36

Baseline Uop cache fits 2K uops

37 of 40

Performance improvement

37

Performance improvement with CLASP + compaction: 5.3 %

5.3%

38 of 40

Power reduction

38

Decoder power reduction with CLASP + compaction: 19.4 %

19.4%

39 of 40

Conclusion

39

  • Uop cache is highly fragmented due to terminating conditions.
  • CLASP reduces fragmentation by relaxing I-cache line boundary termination condition
  • Compaction reduces fragmentation by joining possibly unrelated uop cache entries.

40 of 40

Limitations

40

  • FPWAC incurs an additional read and write
  • Compaction increases uop cache lookup latency