1 of 40

CS-773 Paper Presentation��Improving the Utilization of �Micro-operation Caches in x86 Processors

�

Sumon Nath�Hyperthreads(#6)�sumon@cse.iitb.ac.in

Some of the figures are adapted and modified from the paper

2 of 40

Coming up

Background - Micro op cache
Impact of Micro op cache
Motivation - Fragmentation
Solutions proposed
Conclusion

3 of 40

Background: CPU frontend

4 of 40

Background: CPU frontend

Range of address ~ basic block

5 of 40

Background: CPU frontend

Variable length ISA -> Power hungry decoder

6 of 40

Background: CPU frontend

Fixed length micro-op(uop) -> simpler execution logic

7 of 40

Background: CPU frontend

Uops cached into uop cache

8 of 40

Background: CPU frontend

Hit uop cache -> bypass fetch & decode

9 of 40

Prediction window(PW)

PW : can start and end anywhere in a cache line
Termination conditions

(1) I-cache line boundary (2) Predicted taken branch

10 of 40

Uop cache entry

11 of 40

Uop cache entry

12 of 40

Uop cache entry

13 of 40

Uop cache entry

14 of 40

Uop cache entry

PW termination condition -> uop cache entry written to uop cache line
Uop cache entry: set of uops
Takeaway: uop cache entry may not occupy entire uop cache line

uop cache line

cache entry

15 of 40

Impact of Uop cache

25%

40%

16 of 40

Impact of Uop cache

Average UPC improvement: 11.2 % (64K uop cache)
Average Reduction Decoder power consumption: 39.2 %

17 of 40

What next?

Throw money: Increase Uop cache size

Optimize the current Uop cache design

18 of 40

Motivation: Fragmentation

72 % of Uop cache lines are highly fragmented
Main reason: termination conditions

*cache line size: 64B

19 of 40

Fragmentation source: I-cache line boundary

Termination condition leads to smaller uop cache entries
Low uop cache utilization

20 of 40

Fragmentation source: Predicted taken branch

Termination condition leads to smaller uop cache entries
Low uop cache utilization

21 of 40

Fragmentation source: other conditions

Other terminating conditions:

Max. no. of immediate/displacement values per cache line
Max. no. of microcoded instructions per cache line

22 of 40

Solutions proposed

Two solutions to reduce fragmentation:

CLASP
Compaction

RAC
PWAC
Forced-PWAC

23 of 40

Cache line boundary agnostic Uop (CLASP)

I-cache line

boundary

Sequential

code

(no branch)

24 of 40

Cache line boundary agnostic Uop (CLASP)

Merged

Relaxes I-cache line boundary termination condition
Merges uop cache entries from sequential code
Doubles dispatch bandwidth

25 of 40

Cache line boundary agnostic Uop (CLASP)

On average 35% of Uop cache entries are merged by CLASP

26 of 40

Pause

Any questions?

27 of 40

Fragmentation source: Predicted taken branch

28 of 40

Fragmentation source: Predicted taken branch

On average 49% of Uop cache entries terminated by predicted taken branches

29 of 40

Compaction

Target termination conditions other than I-cache line boundary
Compacts Uop cache entries in a single cache line
Entries are not merged together to a single entry unlike CLASP
Dispatch bandwidth remains same unlike CLASP

30 of 40

Compaction: Challenge

Which Uop cache entries to compact together?

31 of 40

Replacement aware compaction(RAC)

MRU

Compacts new Uop cache entry with the MRU line in cache set
Ensures Uop cache entries are temporally correlated

Note: compaction at the time of Uop cache fill

32 of 40

PW aware compaction (PWAC)

MRU

*Overrides RAC

Attempts to compact new Uop cache entry with entries from same Prediction window

33 of 40

Forced-PWAC

T₀

34 of 40

Forced-PWAC

T₀

T₁

Forces compaction of entries from same PW

35 of 40

Compaction technique distribution

Note: all 3 compaction techniques work simultaneously
Usage priority: F-PWAC > PWAC > RAC
Almost equal distribution

36 of 40

Evaluation setup

Baseline Uop cache fits 2K uops

37 of 40

Performance improvement

Performance improvement with CLASP + compaction: 5.3 %

5.3%

38 of 40

Power reduction

Decoder power reduction with CLASP + compaction: 19.4 %

19.4%

39 of 40

Conclusion

Uop cache is highly fragmented due to terminating conditions.
CLASP reduces fragmentation by relaxing I-cache line boundary termination condition
Compaction reduces fragmentation by joining possibly unrelated uop cache entries.

40 of 40

Limitations

FPWAC incurs an additional read and write
Compaction increases uop cache lookup latency