1 of 46

CS-773 Paper Presentation

�Predicting Performance Impact of DVFS for Realistic Memory Systems

Abhinav Sridhar

The Booster Dose (#1)

abhinavsridhar22@iitb.ac.in

2 of 46

DVFS

Dynamic Voltage Frequency Scaling
Why run the processor at full throttle when memory is slowing you?

3 of 46

DVFS

Two ways to fill the stall gap cycles:

Increase independent instructions
Slow down the processor

Image Reference: Slides by Prof. Biswabandan Panda

4 of 46

Two Phase View

Two major assumptions:

All memory accesses have same latencies
At a miss, once independent instructions are executed, the processor stalls until access is serviced

5 of 46

Current Methods: Leading Loads

T_memory → memory access latency

C_compute → # compute cycles

T = C_compute * t + T_memory

6 of 46

Current Methods: Stall Time

Counter-based architecture
Calculates execution time based on stalled time experienced

7 of 46

Shortcomings: DRAM

DRAM access latency can be variable due to:

Row hits
Bank conflicts and bank level parallelism
Variable time spent in memory queue

8 of 46

Shortcomings: Prefetching

Multiple main memory accesses for data to be used in future
Significantly increases memory bandwidth demand

9 of 46

Potential for Improvement

STATE OF THE ART DVFS CONTROLLERS CRUMBLE UNDER REALISTIC DRAM MODEL AS WELL AS PREFETCHERS

10 of 46

Realistic Execution Sequence

Not all DRAM accesses have equal latencies
Independent instructions continue to fill the ROB during a long latency miss

11 of 46

CRIT: Critical Path Calculation

Assumption: If two memory requests are serialized, it can be considered as a dependency chain

12 of 46

CRIT: Critical Path Calculation

P_global : Global Critical Path Counter
Initially, P_global = 0
P_i : P_global at initiation of i^th request
After completion of i^th load, update P_global as P_global = max(P_global, P_i + ΔT)

13 of 46

CRIT: Critical Path Calculation

When Load A & Load B are initiated, set

P_A = P_B = 0, since P_global = 0

14 of 46

CRIT: Critical Path Calculation

When Load A finishes set

P_global = max(0,A) = A

15 of 46

CRIT: Critical Path Calculation

When Load C is initiated, set P_C= A, since P_global = A

16 of 46

CRIT: Critical Path Calculation

When Load B is completed,

P_global = max(A, B) = B

17 of 46

CRIT: Critical Path Calculation

When Load C is completed

P_global = max(B, A+C) = A+C

18 of 46

CRIT: Critical Path Calculation

Similarly when Load D and Load E are initiated, P_D = P_E = A+C

19 of 46

CRIT: Critical Path Calculation

When Load D is completed, P_global = A+C+D
When Load E is completed, P_global = A+C+E

20 of 46

CRIT: Critical Path Calculation

Store and Write-backs are ignored because they do not stall the processor

21 of 46

Effects of Prefetching

When prefetching is introduced,performance saturates as frequency increases
As DRAM latency remains constant, decreasing the compute period beyond a point will not result in better performance

22 of 46

Limited Bandwidth Performance Model

Frequency of operation is divided into 2 zones

Low frequency range where DRAM can service prefetch request before chip stall
High frequency range where DRAM cannot service prefetch request before chip stall

23 of 46

Limited Bandwidth Performance Model

T_{min memory} ; ∀ t < t_crossover

T_demand + t * C_compute ;

elsewhere

T =

24 of 46

DRAM Slack

T_{min memory} = T_{memory access} - T_{memory slack}
T_{memory slack} is defined as extra time spent by DRAM so that timing constraints are not violated

25 of 46

Hardware Overheads

26 of 46

Experimental Methodology

No Simulator mentioned

27 of 46

Policies for Comparison

Static Optimal: Runs benchmark at all frequency points, choose one with lowest power consumption

28 of 46

Policies for Comparison

Dynamic Optimal: Run benchmarks on all frequency points, choose lowest power for each execution phase

29 of 46

Policies for Comparison

Perfect Memoryless: For each execution interval, choose the chip frequency that minimized energy consumption for previous interval

30 of 46

Results: Energy Reduction

Memory Intensive Benchmarks (Without Prefetching)

31 of 46

Results: Energy Reduction

Non-Memory Intensive Benchmarks (Without Prefetching)

32 of 46

Results: Energy Reduction

Prefetch-Heavy Benchmarks

33 of 46

Results: Energy Reduction

Prefetch-Light Benchmarks

34 of 46

Results

35 of 46

Results

All prefetch heavy benchmarks lie above y = x

36 of 46

Conclusion

Previously proposed DVFS performance predictors

Over-simplifies memory system
Inefficient for realistic DRAM models

CRIT + BW realizes 65% of potential energy savings

37 of 46

References

Figures, unless mentioned, have been taken from: Miftakhutdinov Rustam, Eiman Ebrahimi, and Yale N. Patt. "Predicting performance impact of DVFS for realistic memory systems." ,MICRO 2012

38 of 46

THANK YOU

39 of 46

Critical Points

No simulator mentioned
Mean numbers used in figures 9,11 not mentioned

40 of 46

Current Methods: Leading Loads

If t = cycle time, C_compute = number of compute cycles:

T_compute = C_compute * t, where T_compute = time spent in compute instructions

Based on an abstract view of execution

41 of 46

Two Phase View

Two major assumptions:

All memory accesses have same latencies
At a miss, once independent instructions are executed, the processor stalls until access is serviced

42 of 46

CRIT: Critical Path Calculation

43 of 46

CRIT: Critical Path Calculation

44 of 46

Current Methods: Leading Loads

If T_memory = Time spent in memory accesses, execution time (T) can be calculated as:

T = C_compute * t + T_memory

45 of 46

Limited Bandwidth Performance Model

Execution Time (T) =

T_{min memory} ; ∀ t < t_crossover

T_demand + t * C_compute ;

elsewhere

46 of 46

Results

All prefetch heavy benchmarks lie above y = x