1 of 21

Coordinated Management of Multiple Interacting Resources in CMP: A MLA

-by Bitirgen et alia

Presenter: Shashank (hegde031@)

2 of 21

Introduction

  • As #cores on chip increases, pressure on shared resources increases like: cache, memory BW etc.
  • Unrestrictive sharing of resources may lead to destructive interference
    • May slow down the system
  • Baseline: Fair share policy static allocation.
  • Resources on the computer heavily dependent on each other.
  • Allocation process is mission critical
    • What resource to allocate for workload: :() { : | : &}; :

3 of 21

Problem

Goals

  • How to share resources in a multiprogram and multi core system?
  • Tuning for one parameter affects the utilization of the other
    • Ex: Increasing Cache allocation can impact the memory bandwidth usage
    • Increase power budget ⇒ Run at high freq ⇒ ask for more Memory BW.
  • High utilization
  • High performance
  • Small overhead: Resource allocation itself should not be a bloat to the performance.

ML

4 of 21

Tunable parameters/System Overview

  • 4 Core Processor (4 application workloads)
  • Shared Last Level Cache(L2) - 16ways
  • Off chip bandwidth - 6.4GBps
  • Power Budget - (tuned implicitly by varying operating frequency and voltage) - 60W

5 of 21

What already exists?

  • Uncoordinated system
    • Cache Partitioning
      • Specify boundaries of cache/core
      • Dynamic partition: Partition cache at granularity of Ways per core-application
    • Bandwidth Management: follows classic queuing theory techniques
      • FCFS - prioritizes ready, old commands → QoS/Fairness issue in context Multiprogrammed Workloads
      • FQMS
    • Power Management
      • Highest performing allocation policy: PerCore-DVFS → Optimize for throughput
  • Coordinated Management
    • (IssueQueue, ROB, Regiter File) ⇒ search trail measured → Hill Climbing technique

6 of 21

Proposed Technique

  • Per application perf model
  • Ensemble learning(Average)
  • 12 input, 4 layer feed forward NN
    • L2 cache space, off-chip bandwidth, power budget
    • Number of read hits, read misses, write hits, and write misses over the last 20K inst and over the 1.5M inst
    • Fraction of cache ways that are dirty (the amount of WB traffic)

7 of 21

View of Ensemble

8 of 21

Questions UnAddressed thus far

  1. What are we predicting? → IPC
  2. How is data generated?
    1. Run a Workload for 800M inst → run for 150M inst with random resource allocation (collect 300 points of IPC values) 4 - 50 fold for training , 100 for test
  3. How do we estimate the confidence of the choice? → CoV of Prediction Error (from Ensemble)
  4. What does the update and measure interval look like while training?
    • Make resource allocation decision (at every 500,000 cycle) using the trained per-application performance model

9 of 21

Initial Resource Allocation and CoV estimates

  • 1 cache way per application per core (freedom 12 ways)
  • 800MBps per application of total 6.4GBps (freedom 3.2GBps)
  • 60W overall budget shared among applications

10 of 21

Defining the Search Space

  • Goal: Conduct exhaustive search on all possible combination of resource allocations.
  • Exhaustive Search is really exhausting → quad core allocation x 3 resource params ⇒ 1 billion+ combinations
  • Can we navigate to a smaller portion of the space → Heuristic Search algorithm
    • Probability of making low-confidence prediction increases exponentially with degree of Multiprogramming
    • Performance of one application in the quad workload is predicted with high error rates → infeasible
    • Search on Specific subregions for Subset of application S.T. fair share is guaranteed → 11 subregions
  • If baseline performance is inaccurate(low CoV) mark the subregion for application illegal for next interval.

11 of 21

Update Cycle

Interval ~ 2000 queries/25000 cycles

gResource

Mgr

Random Resource allocator

High Error

Query

ANN

12 of 21

Sampling Training Data

  • Execution of previously unobserved code regions in training can render ANNs obsolete
  • Distribute resources uniformly random over allocation space.
  • Training set consists of 300 sample points at all times.
  • Insert via random replacement policy
  • Fair Queuing Memory controller can significantly degrade performance (as applications could exceed their bandwidth quota)
    • Sampling and stalling issue request fixes the issues
    • Excessive sampling can degrade the performance.

13 of 21

Hardware implementation

  • 4 ensemble ANNs for 4 applications
    • 12 inputs, 4 hidden units -> 1 output
  • Implementing 4 ANNs on hardware is expensive.
    • Soln: 1 ANN → Multiplex to achieve 16
  • Some insight into how much math is done:
    • (12+1)*4 pairwise multiplier and add (MAC units)
    • Sigmoid function is stored as a 50 entry LUT → Sum above is looked up
  • One query inference is not ready until all 16 ANNs have finished processing.
  • All in all:
    • 1.5% area overhead for 200mm2 chip
    • 16 ANNs need (224x3) 16bit fixed-point weights
    • 1.3KB storage for weights and 7.6KB to store 300 sample points

14 of 21

Metrics of Evaluation

  • Sum of IPCs
    • Works for homogenous and/or latency independent workloads
  • Weighted SpeedUp(used for choosing workloads)
    • For Latency sensitive workload
  • Harmonic Mean of normalized IPCs
    • Strikes balance between system throughput and fairness
  • Weighted sum of IPCs
    • Assign priorities for a mix of workloads: OS assigns weights [[1, 4]]

15 of 21

Workload Construction

  • Cache Sensitive ( L2min-alloc < Req < L2)
  • Memory Bound (Req > L2)
  • Compute Bound (Req < L2)
  • Total of 46 workloads (Pick 9) �(static Fair share alloc)
  • FF 800M instruction
  • WarmUp ANNs → 200M inst
  • Run for another 200M inst measure performance

16 of 21

Results

  • Fair Share Allocation policy is the baseline
  • 9 configurations of resource allocation are compared against this approach
    • 1 unmanaged
    • 3 isolated
    • 3 Pairwise uncoordinated
    • 1 triplet uncoordinated
    • 1 triplet coordinated hill climbing technique (not same params*)

17 of 21

Inference

  • Uncoordinated 2, 3 in several cases degrades perf as opposed to 1
    • UC resource mgmt is not additive ⇒ could cause destructive interference
  • Coordinated HC generally outperforms uncoordinated 3 resource (basically it’s a too many cooks issue)
    • Still inferior to Fair-Share
      • Phase change before the algorithm converges
  • ML based model outperforms everything as expected
    • 14% speed up over fair share
  • % Time spent on RA depends on itself and other applications running with it. ⇒ Resource isolation is imperfect.

18 of 21

Impact of CoV

  • Lower CoV default to fair share allocation(conservative decision making)
  • Higher CoV over ambitious, tend to yield misguided performance

19 of 21

Intervals spent on Allocation optimization

  • C and M engage in optimization more frequently
    • Cache needs keep changing from one execution section to another
    • Same applies to memory
  • Why P does not engage much
    • Memory transfers usually offloaded to MMU→ DMA cpu can continue executing some other thread?

20 of 21

Conclusion - High points

  • Training using done fixed set of input configurations derived for fixed set of workloads
  • It is cool that this approach facilitates dynamic allocation and training(letting applications to undergo phase changes).
  • For a 14% improvement from the baseline, hardware overhead is nominal
  • Cycle accurate simulations are conducted

21 of 21

Worth considering

  • Time series of overall performance and resource utilization and per application plot could communicate how these models perform and what stages they are under pressure.
  • LSTM based approach could help in making better decisions as we are aiming at timeseries predictions?
    • What about classic control system technique for max perf/utilization → PID loops, kalman filters etc?
  • Update interval depends on the kind of workload (P,C,M), could hinder the allocation if there were a lot of context switches? (possibly more applications run on a core)
    • Num ANNs = 4* NumApplications → they believe that NumCores > NumApplications. Is this always true? Esp. in this containerized execution world?
  • Authors show comparison of IPCs, could resource utilization of hardware also make it better?
  • One ensemble per Application, that’s a lot
    • Depends of the assumption: #applications running on CMP is going to rise slowly compared to #cores
  • Where would Control Groups fit in this scenario?