1 of 21

Coordinated Management of Multiple Interacting Resources in CMP: A MLA

-by Bitirgen et alia

Presenter: Shashank (hegde031@)

2 of 21

Introduction

As #cores on chip increases, pressure on shared resources increases like: cache, memory BW etc.
Unrestrictive sharing of resources may lead to destructive interference

May slow down the system

Baseline: Fair share policy static allocation.
Resources on the computer heavily dependent on each other.
Allocation process is mission critical

What resource to allocate for workload: :() { : | : &}; :

3 of 21

Problem

Goals

How to share resources in a multiprogram and multi core system?
Tuning for one parameter affects the utilization of the other

Ex: Increasing Cache allocation can impact the memory bandwidth usage
Increase power budget ⇒ Run at high freq ⇒ ask for more Memory BW.

High utilization
High performance
Small overhead: Resource allocation itself should not be a bloat to the performance.

ML

4 of 21

Tunable parameters/System Overview

4 Core Processor (4 application workloads)
Shared Last Level Cache(L2) - 16ways
Off chip bandwidth - 6.4GBps
Power Budget - (tuned implicitly by varying operating frequency and voltage) - 60W

5 of 21

What already exists?

Uncoordinated system

Cache Partitioning

Specify boundaries of cache/core
Dynamic partition: Partition cache at granularity of Ways per core-application

Bandwidth Management: follows classic queuing theory techniques

FCFS - prioritizes ready, old commands → QoS/Fairness issue in context Multiprogrammed Workloads
FQMS

Power Management

Highest performing allocation policy: PerCore-DVFS → Optimize for throughput

Coordinated Management

(IssueQueue, ROB, Regiter File) ⇒ search trail measured → Hill Climbing technique

6 of 21

Proposed Technique

Per application perf model
Ensemble learning(Average)
12 input, 4 layer feed forward NN

L2 cache space, off-chip bandwidth, power budget
Number of read hits, read misses, write hits, and write misses over the last 20K inst and over the 1.5M inst
Fraction of cache ways that are dirty (the amount of WB traffic)

7 of 21

View of Ensemble

8 of 21

Questions UnAddressed thus far

What are we predicting? → IPC
How is data generated?

Run a Workload for 800M inst → run for 150M inst with random resource allocation (collect 300 points of IPC values) 4 - 50 fold for training , 100 for test

How do we estimate the confidence of the choice? → CoV of Prediction Error (from Ensemble)
What does the update and measure interval look like while training?

Make resource allocation decision (at every 500,000 cycle) using the trained per-application performance model

9 of 21

Initial Resource Allocation and CoV estimates

1 cache way per application per core (freedom 12 ways)
800MBps per application of total 6.4GBps (freedom 3.2GBps)
60W overall budget shared among applications

10 of 21

Defining the Search Space

Goal: Conduct exhaustive search on all possible combination of resource allocations.
Exhaustive Search is really exhausting → quad core allocation x 3 resource params ⇒ 1 billion+ combinations
Can we navigate to a smaller portion of the space → Heuristic Search algorithm

Probability of making low-confidence prediction increases exponentially with degree of Multiprogramming
Performance of one application in the quad workload is predicted with high error rates → infeasible
Search on Specific subregions for Subset of application S.T. fair share is guaranteed → 11 subregions

If baseline performance is inaccurate(low CoV) mark the subregion for application illegal for next interval.

11 of 21

Update Cycle

Interval ~ 2000 queries/25000 cycles

gResource

Mgr

Random Resource allocator

High Error

Query

ANN

12 of 21

Sampling Training Data

Execution of previously unobserved code regions in training can render ANNs obsolete
Distribute resources uniformly random over allocation space.
Training set consists of 300 sample points at all times.
Insert via random replacement policy
Fair Queuing Memory controller can significantly degrade performance (as applications could exceed their bandwidth quota)

Sampling and stalling issue request fixes the issues
Excessive sampling can degrade the performance.

13 of 21

Hardware implementation

4 ensemble ANNs for 4 applications

12 inputs, 4 hidden units -> 1 output

Implementing 4 ANNs on hardware is expensive.

Soln: 1 ANN → Multiplex to achieve 16

Some insight into how much math is done:

(12+1)*4 pairwise multiplier and add (MAC units)
Sigmoid function is stored as a 50 entry LUT → Sum above is looked up

One query inference is not ready until all 16 ANNs have finished processing.
All in all:

1.5% area overhead for 200mm² chip
16 ANNs need (224x3) 16bit fixed-point weights
1.3KB storage for weights and 7.6KB to store 300 sample points

14 of 21

Metrics of Evaluation

Sum of IPCs

Works for homogenous and/or latency independent workloads

Weighted SpeedUp(used for choosing workloads)

For Latency sensitive workload

Harmonic Mean of normalized IPCs

Strikes balance between system throughput and fairness

Weighted sum of IPCs

Assign priorities for a mix of workloads: OS assigns weights [[1, 4]]

15 of 21

Workload Construction

Cache Sensitive ( L2_min-alloc < Req < L2)
Memory Bound (Req > L2)
Compute Bound (Req < L2)
Total of 46 workloads (Pick 9) �(static Fair share alloc)

FF 800M instruction
WarmUp ANNs → 200M inst
Run for another 200M inst measure performance

16 of 21

Results

Fair Share Allocation policy is the baseline
9 configurations of resource allocation are compared against this approach

1 unmanaged
3 isolated
3 Pairwise uncoordinated
1 triplet uncoordinated
1 triplet coordinated hill climbing technique (not same params*)

17 of 21

Inference

Uncoordinated 2, 3 in several cases degrades perf as opposed to 1

UC resource mgmt is not additive ⇒ could cause destructive interference

Coordinated HC generally outperforms uncoordinated 3 resource (basically it’s a too many cooks issue)

Still inferior to Fair-Share

Phase change before the algorithm converges

ML based model outperforms everything as expected

14% speed up over fair share

% Time spent on RA depends on itself and other applications running with it. ⇒ Resource isolation is imperfect.

18 of 21

Impact of CoV

Lower CoV default to fair share allocation(conservative decision making)
Higher CoV over ambitious, tend to yield misguided performance

19 of 21

Intervals spent on Allocation optimization

C and M engage in optimization more frequently

Cache needs keep changing from one execution section to another
Same applies to memory

Why P does not engage much

Memory transfers usually offloaded to MMU→ DMA cpu can continue executing some other thread?

20 of 21

Conclusion - High points

Training using done fixed set of input configurations derived for fixed set of workloads
It is cool that this approach facilitates dynamic allocation and training(letting applications to undergo phase changes).
For a 14% improvement from the baseline, hardware overhead is nominal
Cycle accurate simulations are conducted

21 of 21

Worth considering

Time series of overall performance and resource utilization and per application plot could communicate how these models perform and what stages they are under pressure.
LSTM based approach could help in making better decisions as we are aiming at timeseries predictions?

What about classic control system technique for max perf/utilization → PID loops, kalman filters etc?

Update interval depends on the kind of workload (P,C,M), could hinder the allocation if there were a lot of context switches? (possibly more applications run on a core)

Num ANNs = 4* NumApplications → they believe that NumCores > NumApplications. Is this always true? Esp. in this containerized execution world?

Authors show comparison of IPCs, could resource utilization of hardware also make it better?
One ensemble per Application, that’s a lot

Depends of the assumption: #applications running on CMP is going to rise slowly compared to #cores

Where would Control Groups fit in this scenario?