1 of 31

CS-773 Paper Presentation��Filter Caching for free: The untapped potential of the Store Buffer

Prajeeth.S�Team(#4) Spectredown

190050117@iitb.ac.in

1

2 of 31

Plan:

2

  • Introduction to SQ/SB
  • Issues with current store queues
  • Store Buffer Cache
  • Coherence and Synonyms
  • Results

3 of 31

Store Queue(SQ):

  • SQ allows ordering of stores - TSO memory model
  • Two purposes
    • SQ - In-order store commit
    • O3 loads fetch from uncommitted stores done before load

3

4 of 31

Store Buffer(SB):

  • Stalls at SQ during commit
  • To reduce latency - stores moved from SQ to SB on commit
  • Merged with SQ as circular FIFO queue

4

Image taken from here

5 of 31

Scope for Improvements in SQ/SB

  • Low Utilization due to:
    • Small size
    • Aggressive eviction policy

5

Image taken from here

6 of 31

Scope for Improvements in SQ/SB

Optimal SQ/SB has a much higher hit ratio

6

Image taken from here

7 of 31

Scope for Improvements in SQ/SB

  • Current design has low hit ratio in SQ/SB
  • Reduces benefit of low latency, low energy
  • Large fraction of Load’s can be fetched from SQ/SB rather than L1

7

8 of 31

Filter Cache: Overview

  • A very small cache between CPU and L1.
  • Lower latency and power
  • But, small size => low hit rate
  • High evictions and writes on load => bad performance

8

9 of 31

SQ/SB as Filter Cache?

  • All stores go through SQ/SB
  • Must be probed on loads
  • SQ/SB pays filter-cache costs without benefits

9

Image taken from here

10 of 31

Reducing L1/TLB accesses using SQ/SB

  • If data in SQ/SB => no L1/TLB access => low latency, energy
  • If miss then sequential access of SQ/SB & L1 => high latency
    • Fix: Accurate prediction of presence in SQ/SB
  • Need maximum data in SQ/SB => filter cache

10

11 of 31

Store Buffer Cache(SBC)

  • A portion of the unified storage SQ/SB -> S/QBC
  • On write to L1, move from SB to SBC
  • Moving from SQ->SB and SB->SBC needs only 1 extra head-pointer
  • Higher hit-rate provided -> eviction from cache when extra space needed

11

12 of 31

SBC Synonyms

  • VA->PA translations needed to avoid synonyms
  • SQ/SB hold both PA and VA.
  • If a load hit on a SBC entry whose PA is not known, no extra TLB accesses needed.

12

13 of 31

SBC Coherence: Naive Approaches

  • MESI Protocol
  • 2 Naive solutions:
    • Forward cache invalidation to S/QBC
    • Flush SBC on any invalidation or downgrade

13

14 of 31

SBC Coherence: Optimization 1

  • Bulk flush only on downgrade from Modified state:
    • Either M->I
    • Or M->S since we can’t track future changes
    • Or eviction of cache line in M state
  • No more flushes on invalidation/eviction of other states compared to Naive #2

14

15 of 31

SBC Coherence: Leading to Optimization 2

  • Life of Cache lines >> Life of SBC entries
  • A downgrade of M state might not be in cache
  • Store “epoch” of data -> flush SBC entries with same epoch
  • Use multi-colored dirty bits
  • For 2 colors - Flush entries from SBC having the same color as that of the L1 line

15

16 of 31

SBC Coherence: Optimization 2

SQ, SB, SBC/SBC (Assuming 2 colors)

16

Black epoch write: Dirty Data from SB written to L1 with black dirty bit set, moved to SBC

Image taken from here

17 of 31

SBC Coherence: Optimization 2

SQ, SB, SBC/SBC (Assuming 2 colors)

17

Black epoch - Invalidation/Downgrade of non-black data: No effect on SBC

Image taken from here

18 of 31

SBC Coherence: Optimization 2

SQ, SB, SBC/SBC (Assuming 2 colors)

18

Black data Invalidation/Downgrade: Flush all entries in SBC that are black and change epoch to red

Image taken from here

19 of 31

SBC Coherence: Optimization 2

SQ, SB, SBC/SBC (Assuming 2 colors)

19

Red epoch write: Dirty Data from SB written to L1 with red dirty bit set, moved to SBC

Image taken from here

20 of 31

SBC Coherence: Optimization 2

SQ, SB, SBC/SBC (Assuming 2 colors)

20

Red epoch - Invalidation/Downgrade of Black data - no effect on SBC

Image taken from here

21 of 31

SBC Coherence: Optimization 2

SQ, SB, SBC/SBC (Assuming 2 colors)

21

Red data Invalidation / Downgrade - Flush out the red data from SBC, change epoch to black

Image taken from here

22 of 31

SBC Coherence: Optimization 2 Extensions

  • 2 colors can generalized to any number of colors.
  • N bits gives 2^N - 1 colors
  • Alternative:
    • Red -> maybe in SBC
    • Black -> not in SBC
    • On flush switch all red->black in cache
    • Red shows only last epoch at any time

22

23 of 31

SBC Coherence: flash-rest v/s 3 colors

23

Image taken from here

24 of 31

SBC Coherence: flash-rest v/s 3 colors

24

7/15/flash reset almost optimal, 3 colors good enough

Image taken from here

25 of 31

Predicting hits: Memory Dependence Predictor

Modern systems can predict which level of cache data is in.

Can be used to reduce latency due to filter cache

25

Image taken from here

26 of 31

Results: With Predictor

26

Memory dependence predictor reduces hit ratio

Image taken from here

27 of 31

Results : Energy Savings

27

Dynamic Energy Savings

Image taken from here

28 of 31

Results - Energy Savings

28

Parallel Workloads

Image taken from here

29 of 31

Results - IPC

29

IPC Improvements

Image taken from here

30 of 31

Best and Worst cases

30

Read Locality(Y/N)

Predictor Accuracy(Y/N)

Energy

Performance

Y

Y

Improvement

Improvement

Y

N

Same

Same

N

Y

Same

Same

N

N

Same

May Reduce

31 of 31

Points to Discuss:

  • Coherence for weaker memory models?
  • Claims that Memory Dependence Predictors already exists. Is it true?

31