CS-773 Paper Presentation��Improving the Utilization of �Micro-operation Caches in x86 Processors
�
Sumon Nath�Hyperthreads(#6)�sumon@cse.iitb.ac.in
1
Some of the figures are adapted and modified from the paper
Coming up
2
Background: CPU frontend
3
Background: CPU frontend
4
Range of address ~ basic block
Background: CPU frontend
5
Variable length ISA -> Power hungry decoder
Background: CPU frontend
6
Fixed length micro-op(uop) -> simpler execution logic
Background: CPU frontend
7
Uops cached into uop cache
Background: CPU frontend
8
Hit uop cache -> bypass fetch & decode
Prediction window(PW)
9
(1) I-cache line boundary (2) Predicted taken branch
Uop cache entry
10
Uop cache entry
11
Uop cache entry
12
Uop cache entry
13
Uop cache entry
14
uop cache line
cache entry
Impact of Uop cache
15
25%
40%
Impact of Uop cache
16
What next?
17
Motivation: Fragmentation
18
*cache line size: 64B
Fragmentation source: I-cache line boundary
19
Fragmentation source: Predicted taken branch
20
Fragmentation source: other conditions
21
Other terminating conditions:
Solutions proposed
22
Two solutions to reduce fragmentation:
Cache line boundary agnostic Uop (CLASP)
23
I-cache line
boundary
Sequential
code
(no branch)
Cache line boundary agnostic Uop (CLASP)
24
Merged
Cache line boundary agnostic Uop (CLASP)
25
On average 35% of Uop cache entries are merged by CLASP
Pause
26
Any questions?
Fragmentation source: Predicted taken branch
27
Fragmentation source: Predicted taken branch
28
On average 49% of Uop cache entries terminated by predicted taken branches
Compaction
29
Compaction: Challenge
30
Replacement aware compaction(RAC)
31
MRU
Note: compaction at the time of Uop cache fill
PW aware compaction (PWAC)
32
MRU
*Overrides RAC
Attempts to compact new Uop cache entry with entries from same Prediction window
Forced-PWAC
33
T0
Forced-PWAC
34
T0
T1
Forces compaction of entries from same PW
Compaction technique distribution
35
Evaluation setup
36
Baseline Uop cache fits 2K uops
Performance improvement
37
Performance improvement with CLASP + compaction: 5.3 %
5.3%
Power reduction
38
Decoder power reduction with CLASP + compaction: 19.4 %
19.4%
Conclusion
39
Limitations
40