1 of 46

CS-773 Paper Presentation

Predicting Performance Impact of DVFS for Realistic Memory Systems

Abhinav Sridhar

The Booster Dose (#1)

abhinavsridhar22@iitb.ac.in

1

2 of 46

DVFS

  • Dynamic Voltage Frequency Scaling
  • Why run the processor at full throttle when memory is slowing you?

2

3 of 46

DVFS

  • Two ways to fill the stall gap cycles:
    • Increase independent instructions
    • Slow down the processor

3

Image Reference: Slides by Prof. Biswabandan Panda

4 of 46

Two Phase View

  • Two major assumptions:
    • All memory accesses have same latencies
    • At a miss, once independent instructions are executed, the processor stalls until access is serviced

4

5 of 46

Current Methods: Leading Loads

Tmemory → memory access latency

Ccompute → # compute cycles

5

T = Ccompute * t + Tmemory

6 of 46

Current Methods: Stall Time

  • Counter-based architecture
  • Calculates execution time based on stalled time experienced

6

7 of 46

Shortcomings: DRAM

  • DRAM access latency can be variable due to:
    • Row hits
    • Bank conflicts and bank level parallelism
    • Variable time spent in memory queue

7

8 of 46

Shortcomings: Prefetching

  • Multiple main memory accesses for data to be used in future
  • Significantly increases memory bandwidth demand

8

9 of 46

Potential for Improvement

STATE OF THE ART DVFS CONTROLLERS CRUMBLE UNDER REALISTIC DRAM MODEL AS WELL AS PREFETCHERS

9

10 of 46

Realistic Execution Sequence

  • Not all DRAM accesses have equal latencies
  • Independent instructions continue to fill the ROB during a long latency miss

10

11 of 46

CRIT: Critical Path Calculation

  • Assumption: If two memory requests are serialized, it can be considered as a dependency chain

11

12 of 46

CRIT: Critical Path Calculation

  • Pglobal : Global Critical Path Counter
  • Initially, Pglobal = 0
  • Pi : Pglobal at initiation of ith request
  • After completion of ith load, update Pglobal as Pglobal = max(Pglobal, Pi + ΔT)

12

13 of 46

CRIT: Critical Path Calculation

  • When Load A & Load B are initiated, set

PA = PB = 0, since Pglobal = 0

13

14 of 46

CRIT: Critical Path Calculation

  • When Load A finishes set

Pglobal = max(0,A) = A

14

15 of 46

CRIT: Critical Path Calculation

  • When Load C is initiated, set PC = A, since Pglobal = A

15

16 of 46

CRIT: Critical Path Calculation

  • When Load B is completed,

Pglobal = max(A, B) = B

16

17 of 46

CRIT: Critical Path Calculation

  • When Load C is completed

Pglobal = max(B, A+C) = A+C

17

18 of 46

CRIT: Critical Path Calculation

  • Similarly when Load D and Load E are initiated, PD = PE = A+C

18

19 of 46

CRIT: Critical Path Calculation

  • When Load D is completed, Pglobal = A+C+D
  • When Load E is completed, Pglobal = A+C+E

19

20 of 46

CRIT: Critical Path Calculation

  • Store and Write-backs are ignored because they do not stall the processor

20

21 of 46

Effects of Prefetching

  • When prefetching is introduced,performance saturates as frequency increases
  • As DRAM latency remains constant, decreasing the compute period beyond a point will not result in better performance

21

22 of 46

Limited Bandwidth Performance Model

  • Frequency of operation is divided into 2 zones
    • Low frequency range where DRAM can service prefetch request before chip stall
    • High frequency range where DRAM cannot service prefetch request before chip stall

22

23 of 46

Limited Bandwidth Performance Model

Tmin memory ; t < tcrossover

Tdemand + t * Ccompute ;

elsewhere

23

T =

24 of 46

DRAM Slack

  • Tmin memory = Tmemory access - Tmemory slack
  • Tmemory slack is defined as extra time spent by DRAM so that timing constraints are not violated

24

25 of 46

Hardware Overheads

25

26 of 46

Experimental Methodology

  • No Simulator mentioned

26

27 of 46

Policies for Comparison

  • Static Optimal: Runs benchmark at all frequency points, choose one with lowest power consumption

27

28 of 46

Policies for Comparison

  • Dynamic Optimal: Run benchmarks on all frequency points, choose lowest power for each execution phase

28

29 of 46

Policies for Comparison

  • Perfect Memoryless: For each execution interval, choose the chip frequency that minimized energy consumption for previous interval

29

30 of 46

Results: Energy Reduction

Memory Intensive Benchmarks (Without Prefetching)

30

31 of 46

Results: Energy Reduction

Non-Memory Intensive Benchmarks (Without Prefetching)

31

32 of 46

Results: Energy Reduction

Prefetch-Heavy Benchmarks

32

33 of 46

Results: Energy Reduction

Prefetch-Light Benchmarks

33

34 of 46

Results

34

35 of 46

Results

All prefetch heavy benchmarks lie above y = x

35

36 of 46

Conclusion

  • Previously proposed DVFS performance predictors
    • Over-simplifies memory system
    • Inefficient for realistic DRAM models
  • CRIT + BW realizes 65% of potential energy savings

36

37 of 46

References

Figures, unless mentioned, have been taken from: Miftakhutdinov Rustam, Eiman Ebrahimi, and Yale N. Patt. "Predicting performance impact of DVFS for realistic memory systems." ,MICRO 2012

37

38 of 46

THANK YOU

38

39 of 46

Critical Points

  • No simulator mentioned
  • Mean numbers used in figures 9,11 not mentioned

39

40 of 46

Current Methods: Leading Loads

  • If t = cycle time, Ccompute = number of compute cycles:
    • Tcompute = Ccompute * t, where Tcompute = time spent in compute instructions
  • Based on an abstract view of execution

40

41 of 46

Two Phase View

  • Two major assumptions:
    • All memory accesses have same latencies
    • At a miss, once independent instructions are executed, the processor stalls until access is serviced

41

42 of 46

CRIT: Critical Path Calculation

  • D

42

43 of 46

CRIT: Critical Path Calculation

  • D

43

44 of 46

Current Methods: Leading Loads

  • If Tmemory = Time spent in memory accesses, execution time (T) can be calculated as:

T = Ccompute * t + Tmemory

44

45 of 46

Limited Bandwidth Performance Model

  • Execution Time (T) =

Tmin memory ; t < tcrossover

Tdemand + t * Ccompute ;

elsewhere

45

46 of 46

Results

All prefetch heavy benchmarks lie above y = x

46